Sunday, August 10, 2014

Decoding Your AWS Bill (Part 3) Loading a Data Warehouse

In the last two posts (part 1, part 2) in this series we used PowerShell to gleam information from our monthly AWS billing report.  While you can use those scripts to learn a great amount of information from about your AWS usage, you will eventually outgrow PowerShell.  In this post I will show you how to load the bill into SQL Server for more detailed analysis.  

In the prior posts we used the monthly reports.  These reports contain a single line item for each resource for the entire month.  In this post we are going to use the hourly report.  This report shows the detailed expenses incurred each hour.  If you shut a machine down at night you will see that reflected in the report.

Creating a Staging Schema

The first step is to create a table in SQL Server to hold the data.  I am calling this a staging table because this post will present a star schema for a data warehouse.  But, you could simply use this one table and run reports directly against it.  

AWS does not have a lot of detail on the schema for the billing reports.  I have been using the following table schema for about a year now and have worked out most of the kinks.  Some of my fields are likely larger than needed, but since my data eventually I end up in a star schema, I prefer to have a little buffer.   Also note that I am using consolidated billing, but I am ignoring the blended rates.  If you wanted to add blended rates, a real should suffice.  

CREATE TABLE [dbo].[StageLineItems](
 [ReportFileName] [nvarchar](256) NOT NULL,
 [InvoiceID] [varchar](16) NOT NULL,
 [PayerAccountId] [bigint] NOT NULL,
 [LinkedAccountId] [bigint] NOT NULL,
 [RecordType] [varchar](16) NOT NULL,
 [RecordID] [decimal](26, 0) NOT NULL,
 [ProductName] [varchar](64) NOT NULL,
 [RateId] [int] NOT NULL,
 [SubscriptionId] [int] NOT NULL,
 [PricingPlanId] [int] NOT NULL,
 [UsageType] [varchar](64) NOT NULL,
 [Operation] [varchar](32) NOT NULL,
 [AvailabilityZone] [varchar](16) NULL,
 [ReservedInstance] [char](1) NOT NULL,
 [ItemDescription] [varchar](256) NOT NULL,
 [UsageStartDate] [datetime] NOT NULL,
 [UsageEndDate] [datetime] NOT NULL,
 [UsageQuantity] [real] NOT NULL,
 [Rate] [real] NOT NULL,
 [Cost] [real] NOT NULL,
 [ResourceId] [varchar](128) NULL,
 [user:Name] [varchar](256) NULL,
 [user:Owner] [varchar](256) NULL,
 [user:Project] [varchar](256) NULL

Loading the Data

Once you have the table created you will need to load the report.  I use an SSIS job for this.  I am loading the "Detailed Line Items with Resources and Tags" report.  This is the most detailed report available.  I created a custom script task in SSIS to download the report from S3.  This package runs every six hours to refresh the data in the warehouse.

My script expects the account id, bucket name, access key, and secret key.  In addition, you can optional specify the month, year and file name.  If you don't specify the optional params, the script will calculate the current month and return it do you can use it later in the package.  See below.

The code for my script task is below.  It calculates the dates and then uses the AWS .Net API to download the file.  Note that you must GAC AWSSDK.dll before SSIS will load it. 

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;

using Amazon;
using Amazon.S3;
using Amazon.S3.Model;
using System.IO;

namespace ST_a8746170d9b84f1da9baca053cbdc671.csproj
    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
        #region VSTA generated code
        enum ScriptResults
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure

        public void Main()
            string accountId = (string)Dts.Variables["accountId"].Value;
            string accessKey = (string)Dts.Variables["accessKey"].Value;
            string secretKey = (string)Dts.Variables["secretKey"].Value;
            string bucketName = (string)Dts.Variables["bucketName"].Value;

            System.DateTime date = DateTime.Now.AddDays(-5);
            int month = (int)Dts.Variables["Month"].Value;
            if (month == 0) { month = date.Month; }
            int year = (int)Dts.Variables["Year"].Value;
            if (year == 0) { year = date.Year; }

                string keyName = string.Format("{0}-aws-billing-detailed-line-items-with-resources-and-tags-{1:0000}-{2:00}", accountId, year, month);
                Dts.Variables["keyName"].Value = keyName;

                string zipFilePath = Path.Combine(Path.GetTempPath(), keyName);
                Dts.Variables["zipFilePath"].Value = zipFilePath;

                string csvFilePath = zipFilePath.Replace(".zip", "");
                Dts.Variables["csvFilePath"].Value = csvFilePath;

                    AmazonS3Config config = new AmazonS3Config()
                        ServiceURL = ""

                    using (IAmazonS3 client = Amazon.AWSClientFactory.CreateAmazonS3Client(accessKey, secretKey, RegionEndpoint.USEast1))
                        GetObjectRequest request = new GetObjectRequest()
                            BucketName = bucketName,
                            Key = keyName

                        using (GetObjectResponse response = client.GetObject(request))
                            if (File.Exists(zipFilePath)) File.Delete(zipFilePath);
                    Dts.TaskResult = (int)ScriptResults.Success;
            catch (AmazonS3Exception amazonS3Exception)
                Dts.TaskResult = (int)ScriptResults.Failure;

Notice that the report we are downloading is a .zip file.  The detailed report can get very large.  I am simply shelling out to 7Zip from the SSIS package to decompress the report.  Finally, note that the report contains a few summary lines you will likely want to exclude from the report when you load.  I use the following filter.

RecordType == "LineItem" && RateId != 0 && !ISNULL(RecordId)

Star Schema

The final piece of the puzzle is loading the data into a warehouse for reporting.  I'm not going to take you through the details of designing a data warehouse, but I can share the schema I am using.  I analyzed the data a few times using a Data Profiling Task, and ultimately settled on the following dimension tables

I hope this series has been helpful.  

1 comment:

  1. Somewhat on topic, but have you found a way to determine what reservation type a non-reserved instance would need from the detailed billing report? The reason being, there must be a way to determine what type of reservation an instance needs, the 3 pieces needed are:
    1. "Platform" (essentially the OS)
    2. "InstanceType"
    3. "AZ"

    If you have these 3 values you should have no problem knowing exactly what reservation to purchase for the instance in question. AWS Support is of no help and says there is no way but companies like Cloudability, CloudCheckr and others have found a way and I'm fairly sure it's via the detailed billing file (they harvest this). Describe-ami API call was not very accurate when it came to custom AMI's, etc as you have to use the description of the AMI to determine this which is not guaranteed in anyway to have the OS name. Here is the full list of reservation platforms available, which do not exactly match what's shown in the "itemDescription" fields in our detailed billing file but perhaps a mapping table will provide the final linkage for a 1:1 relationship.

    Reservation Platform options:
    Linux/UNIX (Amazon VPC)
    SUSE Linux (Amazon VPC)
    SUSE Linux
    Red Hat Enterprise Linux (Amazon VPC)
    Red Hat Enterprise Linux
    Windows (Amazon VPC)
    Windows with SQL Server Standard (Amazon VPC)
    Windows with SQL Server Standard
    Windows with SQL Server Web (Amazon VPC)
    Windows with SQL Server Web
    Windows with SQL Server Enterprise (Amazon VPC)