Sunday, August 10, 2014

Decoding Your AWS Bill (Part 3) Loading a Data Warehouse

In the last two posts (part 1, part 2) in this series we used PowerShell to gleam information from our monthly AWS billing report.  While you can use those scripts to learn a great amount of information from about your AWS usage, you will eventually outgrow PowerShell.  In this post I will show you how to load the bill into SQL Server for more detailed analysis.  

In the prior posts we used the monthly reports.  These reports contain a single line item for each resource for the entire month.  In this post we are going to use the hourly report.  This report shows the detailed expenses incurred each hour.  If you shut a machine down at night you will see that reflected in the report.

Creating a Staging Schema

The first step is to create a table in SQL Server to hold the data.  I am calling this a staging table because this post will present a star schema for a data warehouse.  But, you could simply use this one table and run reports directly against it.  

AWS does not have a lot of detail on the schema for the billing reports.  I have been using the following table schema for about a year now and have worked out most of the kinks.  Some of my fields are likely larger than needed, but since my data eventually I end up in a star schema, I prefer to have a little buffer.   Also note that I am using consolidated billing, but I am ignoring the blended rates.  If you wanted to add blended rates, a real should suffice.  

CREATE TABLE [dbo].[StageLineItems](
 [ReportFileName] [nvarchar](256) NOT NULL,
 [InvoiceID] [varchar](16) NOT NULL,
 [PayerAccountId] [bigint] NOT NULL,
 [LinkedAccountId] [bigint] NOT NULL,
 [RecordType] [varchar](16) NOT NULL,
 [RecordID] [decimal](26, 0) NOT NULL,
 [ProductName] [varchar](64) NOT NULL,
 [RateId] [int] NOT NULL,
 [SubscriptionId] [int] NOT NULL,
 [PricingPlanId] [int] NOT NULL,
 [UsageType] [varchar](64) NOT NULL,
 [Operation] [varchar](32) NOT NULL,
 [AvailabilityZone] [varchar](16) NULL,
 [ReservedInstance] [char](1) NOT NULL,
 [ItemDescription] [varchar](256) NOT NULL,
 [UsageStartDate] [datetime] NOT NULL,
 [UsageEndDate] [datetime] NOT NULL,
 [UsageQuantity] [real] NOT NULL,
 [Rate] [real] NOT NULL,
 [Cost] [real] NOT NULL,
 [ResourceId] [varchar](128) NULL,
 [user:Name] [varchar](256) NULL,
 [user:Owner] [varchar](256) NULL,
 [user:Project] [varchar](256) NULL

Loading the Data

Once you have the table created you will need to load the report.  I use an SSIS job for this.  I am loading the "Detailed Line Items with Resources and Tags" report.  This is the most detailed report available.  I created a custom script task in SSIS to download the report from S3.  This package runs every six hours to refresh the data in the warehouse.

My script expects the account id, bucket name, access key, and secret key.  In addition, you can optional specify the month, year and file name.  If you don't specify the optional params, the script will calculate the current month and return it do you can use it later in the package.  See below.

The code for my script task is below.  It calculates the dates and then uses the AWS .Net API to download the file.  Note that you must GAC AWSSDK.dll before SSIS will load it. 

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;

using Amazon;
using Amazon.S3;
using Amazon.S3.Model;
using System.IO;

namespace ST_a8746170d9b84f1da9baca053cbdc671.csproj
    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
        #region VSTA generated code
        enum ScriptResults
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure

        public void Main()
            string accountId = (string)Dts.Variables["accountId"].Value;
            string accessKey = (string)Dts.Variables["accessKey"].Value;
            string secretKey = (string)Dts.Variables["secretKey"].Value;
            string bucketName = (string)Dts.Variables["bucketName"].Value;

            System.DateTime date = DateTime.Now.AddDays(-5);
            int month = (int)Dts.Variables["Month"].Value;
            if (month == 0) { month = date.Month; }
            int year = (int)Dts.Variables["Year"].Value;
            if (year == 0) { year = date.Year; }

                string keyName = string.Format("{0}-aws-billing-detailed-line-items-with-resources-and-tags-{1:0000}-{2:00}", accountId, year, month);
                Dts.Variables["keyName"].Value = keyName;

                string zipFilePath = Path.Combine(Path.GetTempPath(), keyName);
                Dts.Variables["zipFilePath"].Value = zipFilePath;

                string csvFilePath = zipFilePath.Replace(".zip", "");
                Dts.Variables["csvFilePath"].Value = csvFilePath;

                    AmazonS3Config config = new AmazonS3Config()
                        ServiceURL = ""

                    using (IAmazonS3 client = Amazon.AWSClientFactory.CreateAmazonS3Client(accessKey, secretKey, RegionEndpoint.USEast1))
                        GetObjectRequest request = new GetObjectRequest()
                            BucketName = bucketName,
                            Key = keyName

                        using (GetObjectResponse response = client.GetObject(request))
                            if (File.Exists(zipFilePath)) File.Delete(zipFilePath);
                    Dts.TaskResult = (int)ScriptResults.Success;
            catch (AmazonS3Exception amazonS3Exception)
                Dts.TaskResult = (int)ScriptResults.Failure;

Notice that the report we are downloading is a .zip file.  The detailed report can get very large.  I am simply shelling out to 7Zip from the SSIS package to decompress the report.  Finally, note that the report contains a few summary lines you will likely want to exclude from the report when you load.  I use the following filter.

RecordType == "LineItem" && RateId != 0 && !ISNULL(RecordId)

Star Schema

The final piece of the puzzle is loading the data into a warehouse for reporting.  I'm not going to take you through the details of designing a data warehouse, but I can share the schema I am using.  I analyzed the data a few times using a Data Profiling Task, and ultimately settled on the following dimension tables

I hope this series has been helpful.  

Saturday, August 9, 2014

Decoding Your AWS Bill (Part 2) Chargeback with Tags

It took 6 months but I finally got time to continue the series on Decoding Your AWS bill.  In the last post, we used PowerShell to download and query the monthly bill.  In this post we use tags to create a cost allocation report.  In the next, and final post in this series, I will show you how to load the hourly detail report into SQL Server.

Let's assume that we have multiple project teams at our company and they all have servers running in the same AWS account.  We want to "charge back" each team for their usage.  We begin by tagging each instance with a project name (see figure below).  Notice that I also include a name and owner.

This is good start, but we learned in part one that charges are allocated to the instances as well as the volumes and network interfaces that are attached to them.  Therefore, we have to tag the resources as well as the instance itself.  It is probably unrealistic to ask our users to tag all the resources so let's create a script that copies tags from the instance any resources attached to it.  This way our users only have to remember to tag their instances.

The script below will read all of the tags from the instance and copy them to each resource.  I have something very similar scheduled to run once a day on each of my accounts.

(Get-EC2Instance).Instances | % {

    $Instance = $_

    #First, get the tags from each instance
    $NameTag = $Instance.Tags | Where-Object { $_.Key -eq 'Name' }
    $OwnerTag = $Instance.Tags | Where-Object { $_.Key -eq 'Owner' }
    $ProjectTag = $Instance.Tags | Where-Object { $_.Key -eq 'Project' }

    $Instance.BlockDeviceMappings | % {
        #Copy the tags to each volume
        If($NameTag -ne $null) {New-EC2Tag -Resources $_.Ebs.VolumeId -Tag $NameTag}
        If($OwnerTag -ne $null) {New-EC2Tag -Resources $_.Ebs.VolumeId -Tag $OwnerTag}
        If($ProjectTag -ne $null) {New-EC2Tag -Resources $_.Ebs.VolumeId -Tag $ProjectTag}

    $Instance.NetworkInterfaces | % {
        #Copy the tags to each NIC
        If($NameTag -ne $null) {New-EC2Tag -Resources $_.NetworkInterfaceId -Tag $NameTag}
        If($OwnerTag -ne $null) {New-EC2Tag -Resources $_.NetworkInterfaceId -Tag $OwnerTag}
        If($ProjectTag -ne $null) {New-EC2Tag -Resources $_.NetworkInterfaceId -Tag $ProjectTag}

This is a good start, but it will not really scale well.  It makes an API call for ever resource every time we run it.  It will work well for a handful of instances, but as we add more instances the script will take longer and longer to run.  It would be better to cache the tags collection and only change update those resources that need to be changed.  Here is a much better version.

$AllTags = Get-EC2Tag

Function Rectify-Tag {
    #This function only updates a tag if the current value is not correct
    Param ($ResourceId, $Tag)
    #Find the current tag in the cached collection
    $OldTag = $AllTags | Where-Object {(($_.ResourceId -eq $ResourceId) -and ($_.Key -eq $Tag.Key))}
    If(($OldTag -eq $null) -or ($OldTag.Value -ne $Tag.Value)) {
        #The currrent tag is wrong, let's fix it.
        New-EC2Tag -Resources $ResourceId -Tag $Tag

(Get-EC2Instance).Instances | % {

    $Instance = $_

    #First, get the tags from each instance
    $NameTag = $Instance.Tags | Where-Object { $_.Key -eq 'Name' }
    $OwnerTag = $Instance.Tags | Where-Object { $_.Key -eq 'Owner' }
    $ProjectTag = $Instance.Tags | Where-Object { $_.Key -eq 'Project' }

    $Instance.BlockDeviceMappings | % {
        #Copy the tags to each volume
        If($NameTag -ne $null) {Rectify-Tag -ResourceId $_.Ebs.VolumeId -Tag $NameTag}
        If($OwnerTag -ne $null) {Rectify-Tag -ResourceId $_.Ebs.VolumeId -Tag $OwnerTag}
        If($ProjectTag -ne $null) {Rectify-Tag -ResourceId $_.Ebs.VolumeId -Tag $ProjectTag}

    $Instance.NetworkInterfaces | % {
        #Copy the tags to each NIC
        If($NameTag -ne $null) {Rectify-Tag -ResourceId $_.NetworkInterfaceId -Tag $NameTag}
        If($OwnerTag -ne $null) {Rectify-Tag -ResourceId $_.NetworkInterfaceId -Tag $OwnerTag}
        If($ProjectTag -ne $null) {Rectify-Tag -ResourceId $_.NetworkInterfaceId -Tag $ProjectTag}

Now we have to add the tags we created to our reports.  I assume at this point that you have billing reports enabled.  If not, see my prior blog post.  Log into the web console using your account credentials (not IAM credentials) and click on your name in the top right corner.  From the dropdown, click "Billing and Cost Management."  Choose "Preferences" from the menu down the left side of the screen.  Finally, click the "Manage Report Tags" link toward the end of the screen.

Now, find the tags you want to include in the report (see the figure below).  Make sure you include the project tag.

Now we can download and query the report just like we did in the last post.  The only change is that we are going to use the "$AccountId-aws-cost-allocation-$Year-$Month.csv" report rather than the "$AccountId-aws-billing-csv-$Year-$Month.csv" report we used before.

In addition, note that the custom tags we added will appear in the report as user:tag.  So our Project tag will appear as user:Project.  Therefore, if we wanted to return all the costs associated with the ERP project we would use a PowerShell query like this:

$PayerLineItems |  Where-Object {$_.'user:Project' -eq 'ERP'} | Measure-Object TotalCost -Sum

Now, we have a little problem.  You may notice that if you add up all costs associated to all projects, it does not sum to the invoice total.  This is expected.  There are a few costs we did not capture.  First, we only tagged EC2.  If you want to allocate other services, you will need to develop a similar strategy to the one we used above for EC2.  Second, you may have a support contract that adds 10% to the bill.  Third, there are some EC2 costs, like snapshots that do not include tags in the report.  There is nothing we do we these last two, but allocate them to the projects as overhead.  The script below will do just that.  I'm not going to go into detail, but you can look though my script to understand it.

Set-AWSCredentials LAB

Function Get-CostAllocationReport {

    #If no BucketName was specified, assume it tis the same as the account alias
    $BucketName = Get-IAMAccountAlias

    #If no AccountId was specified, use the account of the current user
    $AccountID = (Get-IAMUser).ARN.Replace('arn:aws:iam::','').Substring(0,12)
    #If no month and year were specified, use last month
    If([System.String]::IsNullOrEmpty($Month)) {$Month = If((Get-Date).Month -eq 1){12}Else{(Get-Date).Month}} 
    If([System.String]::IsNullOrEmpty($Year)) {$Year = If($Month -eq 12){(Get-Date).Year - 1}Else{(Get-Date).Year}} 
    $Month = "{0:D2}" -f [int]$Month #Pad single digit with 0 

    #Get lastest report 
    $Key = "$AccountId-aws-cost-allocation-$Year-$Month.csv" 
    $FileName = "$env:TEMP\$AccountId-aws-cost-allocation-$Year-$Month.csv" 

    #Download the report from S3
    If(Test-Path $FileName) {Remove-Item $FileName}
    $Null = Read-S3Object -BucketName $BucketName -Key $Key -File $FileName 

    #Stip off the first line of the file #Don't see your tags in the report? New tags are excluded by default - go to to update your cost allocation keys.
    $Temp = Get-Content -Path $FileName | Select -Skip 1 
    $Temp | Set-Content -Path $FileName

    Import-Csv $FileName 

$Report = Get-CostAllocationReport

Write-Host "Statement total =" ($Report | Where-Object {$_.RecordType -eq 'StatementTotal'}).TotalCost
$PayerLineItems = $Report | Where-Object {$_.RecordType -eq 'PayerLineItem'} 

$Summary = $PayerLineItems | Group-Object -Property 'user:Project' | % { 
    New-Object psobject -Property @{ 
        Project = $_.Name.Trim(); 
        TotalCost = ($_.Group | Measure-Object TotalCost -Sum).Sum;

$AllocatedCost = ($PayerLineItems |  Where-Object {$_.'user:Project' -ne ''} | Measure-Object TotalCost -Sum).Sum
$UnallocatedCost = ($PayerLineItems |  Where-Object {$_.'user:Project' -eq ''} | Measure-Object TotalCost -Sum).Sum

$ProjectAllocation = $PayerLineItems | Where-Object {$_.'user:Project' -ne ''} | Group-Object -Property 'user:Project' | % { 
    $DirectCost = ($_.Group | Measure-Object TotalCost -Sum).Sum
    $AllocationRate = $DirectCost / $AllocatedCost;
    $IndirectCost = $UnallocatedCost * $AllocationRate;

    New-Object psobject -Property @{ 
        Project = $_.Name.Trim(); 
        DirectCost = $DirectCost;
        Overhead = $IndirectCost;
        TotalCost = $DirectCost + $IndirectCost;

$ProjectAllocation | Format-Table Project, DirectCost, Overhead, TotalCost -AutoSize

When you run this script it should output the statement total and a table showing the costs allocated to each project.  Similar to the the following.

Statement total = 2317.048424386140

Project   DirectCost          Overhead        TotalCost
-------   ----------          --------        ---------
ERP       468.191124  25.2253642231116 493.416488223112
CRM      1181.834959  63.6753149817962  1245.5102739818
DotCom    149.640772    8.062397561231 157.703169561231
ITOM      398.925069  21.4934236199978 420.418492619998

That's it for this post.  In the next post we use the hourly report to populate a warehouse in SQL Server.