Cutting our AWS spend in half

by Alex Lawn

It often seems that, by default, our AWS monthly spends go in only one direction - up.

Our AWS cost reduction strategy has evolved over time, through the work we’ve done for our clients and their cloud platforms, and on our own AWS application stacks. There's a massive amount of devil in the detail. We're proud to share that thanks to this work, we've halved our monthly AWS bills over the last eight months. This comes without compromising the high availability stacks that we have engineered, or sacrificing any application performance.

It's been quite a journey, with lots of things being looked at in parallel. Here's how we did it.

S3 File systems

One of our common cloud-hosted applications is Moodle LMS (and its cousin, Totara). Usually these applications need a networked, shared file system to store application file assets. We were using a triple-AZ redundant GlusterFSFS server, each having a dedicated EBS volume storing an entire copy of the application data. On larger sites this meant a significant storage footprint - 3TB x 3 EBS SSD volumes, coming in at $1080 USD per month with additional costs for non-production environments and backups. We needed a better way.

S3 logo

So, we developed an alternative file storage implementation for Moodle which means we moved the majority of the data files into S3 object storage. This code is available as a plugin here https://github.com/catalyst/moodle-tool_objectfs

Production site data now costs only $75 per month instead of over $1000. With some clever S3 bucket permissions, it’s also possible to allow non-production application instances access to the production bucket. This means we don’t have to duplicate data storage across prod and dev, test, uat and staging. This means cost savings and considerable operational conveniences.

Bandwidth to S3

After initially migrating most of our file storage into S3, we were surprised to find our costs went up not down. After some investigation, we saw 27TB of traffic from our EC2 instances was going via our VPC Nat Gateways at $0.059 per GB. This cost around $1500 USD! We solved this by enabling the VPC endpoint for S3, eliminating almost entirely our NAT Gateway traffic by making the s3 pathway zero rated. Result! And, another example of the curly nature of AWS cost optimisation.

Storage and EBS Volumes

AWS has introduced several new block volume types over the last year, including cold storage and throughput-optimised volumes, both of which are much cheaper than SSD-backed storage. In February 2017 Elastic Volumes was introduced, which allows us to change an EBS volume type and increase volume size without any fuss.

This lets us provision volumes closer to the size of the contents without requiring pre-provisioning for future growth, as it's so easy to grow volumes. It also lets us convert volumes to cheaper and slower types when we don’t need the disk IO throughput performance.

There are still several pitfalls to be aware of when working with Elastic volumes. After changing a volume size or type another change cannot be made for six hours. Cold storage and Throughput optimised volumes have a 500GB minimum size, and EBS volumes cannot be shrunk. Also, slower sc1 and st1 volumes are prone to exhausting the IO Burst credits. If this happens the disk IO will slow dramatically, and you will need to use a faster disk type. This can be a little slow to diagnose if you haven’t see it before and aren’t looking for it.

Carefully going through all our EBS volumes to optimise the size and type of each volume has resulted in savings of about $250 per month.

EBS Snapshots

AWS volume snapshots are a powerful tool for backing up production volumes easily. However, it can be difficult to calculate all the costs to a completely granular level and it’s very easy to take too many snapshots. In the Sydney region we are charged $0.055 per GB/month for unique blocks of data. More frequent snapshots of an EBS volume won’t necessarily blow our bill, as AWS only charge for the block level changes between each snapshot. However, in reality we have seen a lot of snapshot strategies where far too much data is being stored, even taking in to consideration that we pay only for blok deltas.

EBS logo

We found a few old EBS snapshots lying around that were easy to eliminate, saving about $800 per month. We have seen with our clients similar scenarios where 'backup' snapshots get forgotten. Further savings will come from reducing the volume of data in snapshots and the rate of change of our data snapshots.

EBS Snapshots to S3

Even with savings from eliminating older EBS snapshots, we were still spending over $2000 per month on EBS snapshots that are historical backups. Most of the data on these snapshots was unique leading to poor cost savings from shared data. S3 pricing is less than half the price of an EBS snapshot per gigabyte. So by using s3-parallel-put https://github.com/mishudark/s3-parallel-put we were able to upload the contents of these snapshots into S3. Once the EBS snapshots were deleted, we saved us around $1000 per month when the increased S3 storage costs are taken into account. We expect this will improve over time as the bucket policies slowly move objects into Infrequent Access based storage and eventually Glacier.

This is another example of why object storage is often a much more cost-effective way to archive data, but it’s not always a simple like-for-like when moving from block storage snapshot models into object storage.

Cross VPC Bandwith

Bandwidth traffic from one AWS VPC to another using VPC peering incurs a cost. For one of our clusters, one set of webserver nodes were using an NFS mount in from another VPC, this resulted in a lot of cross-VPC traffic. Consolidating everything into a single VPC has eliminated the this traffic entirely. All the more reason to engage experienced network cloud engineers when architecting cloud stacks.

Cross Availability Zone traffic

As part of the High Availability architecture requirements of our application stacks, we build them across AWS Availability Zones (Azs). Traffic from one EC2 instance to another EC2 instance in the same availability zone is free, however when that traffic crosses to an external AZ  a cost is incurred. When things like application load balancing or constant replication cross AZs, high traffic may be triggered and the AWS spend can rise considerably.

Our primary network fileshare is a replicated GlusterFS volume with a node in each AZ. This is in line with the 'Architect for Failure' policy espoused by AWS Solutions Architects. GlusterFS has an obscure mount option available called read-subvolume-index. This mount option allows you to hint to GlusterFS that it should use the local Availability Zones node for read operations if it’s available. This single mount option has saved us around $1500 per month in network traffic that no longer needs to leave an Availability Zone.

The full mount option in /etc/fstab is

datanode-1:/sitedata /var/lib/sitedata Glusterfs defaults,_netdev,fetch-attempts=6,backupvolfile-server=datanode-2,xlator-option=*.read-subvolume-index=2 0 0

Here the read-subvolume-index needs to be different in each availability zone to match the correct GlusterFS server.

RDS Snapshots

Over the last six months RDS Snapshot costs have been slowly creeping up as the number and size of our RDS instances grow. RDS Snapshots are important as they are part of the database PITR (Point In Time Recovery) that AWS offers.

RDS logo

To reduce the costs here we removed legacy snapshots from long-dead databases, set non-production RDS instances to store snapshots for only seven days instead of 31, or disable entirely.

The cost savings here were minimal, but every bit counts.

Turning things off

AWS and other cloud infrastructure solutions give us the flexibility to launch infrastructure on demand. Their success is a testimony to this. And it’s a better world than in the past when we had a long procurement cycle in before we could deploy anything into a data centre.

During our month-long crusade to get AWS costs under control we identified the following services that were able to be turned off. Challenge any organisation to closely look at their own infrastructure and get a bit brutal.

  • 1x t2.medium EC2 instance in a region we don’t generally use, a relic from ages past load testing

  • 2x m4.large RDS instances not used in months that we spun up for testing purposes

  • 2x elastic cache nodes that could be consolidated into an existing one

  • 1x elastic search end point that never indexed a single document

  • Several hundred gigs of unused EBS volumes, somewhere an autoscale group didn’t have terminate EBS volumes on shutdown set

  • 4 unused ELB’s that were getting no traffic and DNS was no longer pointed to them

  • 1 VPN Endpoint that was no longer connected

  • 1x test autoscale group with 2x t2.micro machines that is no longer needed.

Individually, none of these items were particularly big or expensive, however when put together they all add up.

Direct Connect

Given a large portion of the traffic from our AWS account is to our office, a Direct Connect connection could potentially save us several hundred dollars a month. Direct Connect traffic is 1/3 the price of regular outgoing traffic with the overhead having to find an ISP with a direct connect offering, and paying for the direct connect port itself. We did the maths, and the potential savings aren't big enough to justify the time spent, but it's worth looking into.

ICE

Ice is a tool that was open-sourced by Netflix. It allows easy visualisations and examination of your detailed AWS billing, and lets you drill into the spending patterns on a daily basis. It uses the AWS detailed billing reports which get uploaded to an S3 bucket. There are a number of other as-a-service offerings such as CloudCheckr and AWS Trusted Advisor. But remember, these tools only help you decide what to do, they don’t do it for you!

 

ICE

 

Using Reserved instances

The final avenue for cost savings, once all of our infrastructure is of a suitable capacity, has been to invest in Reserved Instances. This means we're committing to a certain AWS usage, and receiving a discount for this. EC2 usage costs are only 25% of our bill with the rest coming from storage, traffic and other AWS services. Still, a 30% reduction is price is welcome.

One thing we have noticed when working with our clients to bring down their AWS bills via the purchase of reserved instances, is that they can have difficulty understanding the model, and they can be hesitant to commit to the large spend for a sizeable Reserved Instance purchase. Believe it or not, Sysadmins and DevOps engineers are not accountants. And there is always an element of risk when committing to a particular AWS service size.

We've discovered that reducing AWS spend is often not any one person's job. We are all acting as the owners of the business and any reduction in our AWS spend is good for our profits.

Google Cloud has just launched a region in Australia, and we think that the Google model for savings is far better. If you run a Google compute instance for an entire month, you incur a 30% saving. No forward planning required, and no risk of a long term reservation going unused should requirements change.

We've found reserving instances for EC2, RDS and Elastic cache initially resulted in a higher one-off charge, however over the coming months this should save us several thousand dollars.

Conclusion

The most important thing in managing your AWS spend, is paying attention. Get the knowledgeable people in a room to review the total spend. This will be hard the first few times, but you will almost certainly identify some potential savings.

There is no silver bullet to solve all your AWS cost problems. It’s most likely going to be a combination of vigilance and some pragmatic AWS usage policy. Also, don’t discount the value in engaging an AWS Partner to review your infrastructure usage.

There are no doubt more strategies than are mentioned in this blog, and we'll be looking at other things we can do to further reduce our bill over time.

Hope this is of use to you out there.

Alex Lawn is one of the founding members of the Team Cloud, Catalyst’s dedicated cloud consultancy initiative.