AutoSpotting now handles complex launch configurations when replacing your EC2 instances with cheaper spot ones, and also got open-sourced.

Later Update: The code is now available on Github: https://github.com/cristim/autospotting

Today I finally reached a great milestone: for the first time I was able to get AutoSpotting provide spot instance replacements for AutoScaling groups of on-demand instances having full-blown real-life configurations with things like IAM roles, enhanced monitoring and attached IP addresses.

For those of you who are not familiar with it, have a look at this presentation as well as my previous blog posts where it’s explained what it does and it is presented in more detail.

In addition, for those folks who are still running stuff on EC2-Classic environments, the latest build now also supports EC2-Classic security groups, which means that EC2-Classic works as well.

In order to test all this for real, I enabled it on an existing development environment running on EC2-Classic, which from the infrastructure perspective happens to be configured almost identically to those that are serving https://maps.here.com, and I’m happy to say that it worked like a charm:

autospotting

In the image above you can see a screenshot taken while replacing the group’s instances.

Notice how in eu-west-1c we actually got a m1.medium instance which was chosen in order to spread to multiple instance types because at that time we used to have another m3.medium instance in that Availability Zone, since choosing the same instance type on too many machines may become risky.

Currently the algorithm prefers the cheapest instance type, but in order to avoid placing all the eggs in the same basket, when we have more than 20% of the total group’s capacity of the same spot instance type within a single Availability Zone, the next cheapest instance type from that zone is chosen in order to reduce the chance of simultaneous failures of too many instances in case of sudden price fluctuations.

To make things even more interesting, during the replacement process one of the new spot instances failed to be fully configured and didn’t become healthy when its grace period was over(we just happen to have an overkill setup process running at instance startup which sometimes fails to finish during the allotted grace time), so it was terminated by AutoScaling immediately after being added to the group. AutoScaling soon replaced the failed instance with another on-demand instance, later to be replaced by a new cheaper spot one. But eventually the group converged to a fully spot configuration.

Also because the group’s scaling policy is currently based on CPU usage and has a quite low threshold, in the middle of all this replacement process a high CPU alarm fired due to the high load caused by the bootstrap of one of the new spot instances, so another new instance was launched by AutoScaling, only to be replaced by a new spot instance that was later teminated by a subsequent scale-in operation.

Eventually all this churn ended, and a group that would previously cost about $98 on a monthly basis, would now cost less than $17 assuming the price remains stable, which is more than 5.5 times cheaper on the long term.

So all in all it looks pretty good and reliable enough for dev environments (but I wouldn’t immediately put it in production) and it allows for huge cost savings. Feel free to give it a try using these instructions and let me know if you have any issues.

Before anyone asks, the software is not yet open sourced, but the review process is advancing fast and some important approvals are already there, so it’s now a matter of just a few more weeks.

Many of the latest improvements were developed with a lot of help from @nmeierpolys. His bug reports, suggestions and patience during multiple rounds of testing were priceless, and I am very thankful for all his contributions.

Known issues:

  • It is currently broken for environments where the instances are set up depending on information set on their EC2 tags. This is due to the fact that currently the instance tags are set on the new instances very late, at the same time when the new instance is added to the AutoScaling group. So in case the user_data script depends on information derived from the instance tags, the information would very likely be missing at the time the instance runs the user_data script and the instance would fail to be configured. I am planning to set the EC2 tags much earlier, but your user_data script shouldn’t rely they are there when the instance was started.
  • The issue mentioned above was fixed as of July 17. The EC2 tags are now set as soon as the new spot instances are launched.

Automatic replacement of Autoscaling nodes with equivalent spot instances: seeing it in action

Over the last few days since my previous post I’ve got thousands visitors, dozens of comments were posted and a few brave souls were even audacious enough to give it a try. Many of you provided valuable feedback and bug reports, so thank you all and keep the feedback coming!

I am now quite busy improving the software based on the feedback I’ve got so far and also on some bugs I found on my own, but before I have anything ready to be released,  I thought I should post some kind of HOWTO that shows how to install and set it up, and shows a demo of the instance replacement process, also exposing the currently known issues you should expect when using it at this point.

It’s still not ready for production usage, but I’m working on it.

Installation

The initial set up is done using CloudFormation, so you will need to launch a new CloudFormation stack. Since the Stack creates a lambda function, due to a Lambda limitation you can only launch the stack in us-east-1 (Virginia), but the stack can handle resources in all the other regions available to normal AWS accounts. For multiple reasons, at the moment the Beijing and GovCloud regions are unsupported.

Using the AWS console

Follow the normal stack creation process shown in the screenshots below.

The template URL is all you need to set, it should be https://s3.amazonaws.com/cloudprowess/dv/template.json

installationCloudFormation
Stack creation based on my template

Give the stack a name, then you can safely go through the rest of the process. You don’t need to pass any other parameters, just make sure you confirm everything you set so far and acknowledge that the stack may create some IAM resources on your behalf.

installationCloudFormation2
Naming the stack

 

If everything goes well the stack will start creating resources.

installationCloudFormation4
Creating the stack

And after a few minutes you should be all set.

installationCloudFormation5
The stack is ready

Using the AWS command line tools

If you already have installed the AWS command line tools, you can also launch the stack using the command line, using the following command:

aws cloudformation create-stack \
--stack-name AutoReplaceWithSpot \
--template-url https://s3.amazonaws.com/cloudprowess/dv/template.json \
--capabilities CAPABILITY_IAM

Configuration for an AutoScaling group

The installation using CloudFormation will create the required infrastructure, but your AutoScaling groups will not be touched unless you explicitly enable this functionality, which has to be done for each and every AutoScaling group which you would like to manage.

The managed AutoScaling groups can be in any other AWS region, the algorithm will run on all the regions in parallel, handling AutoScaling groups if and only if it was enabled for them. It makes no difference if your group is running EC2 Classic or VPC instances, since both are supposed to be supported. If you notice any issues when testing it in your setup, that’s likely a bug and would need to be reported.

Enabling it on an AutoScaling group is a matter of setting a tag on the group:

Key: spot-enabled
Value: true

This can be configured with the AWS command-line tools using this command:

aws autoscaling create-or-update-tags \
--tags ResourceId=my-auto-scaling-group,ResourceType=auto-scaling-group,Key=spot-enabled,Value=true,PropagateAtLaunch=false

 

If you use the AWS console, follow the steps that you can see below:

beforeAutoScaling
Initial state of the AutoScaling group
enabling
Tagging the AutoScaling group where it is being enabled

The tag isn’t required to be propagated to the new instances, so that checkbox can remain empty.

The AutoScaling group tags can also be set using CloudFormation, just insert this snippet into your AutoScaling group’s configuration:

"MyAutoScalingGroup": {
  "Properties": {
    "Tags":[
    {
      "Key": "spot-enabled",
      "Value": "true",
      "PropagateAtLaunch": false
    }
    ]
  }
}

Walkthrough

Going forward I’m going to show what happens after enabling it on an AutoScaling group.

Once it was enabled on an AutoScaling group, the next run will launch a compatible EC2 Spot instance.

Note: the new spot instance is not yet added to any of your AutoScaling groups.

The new instance type is chosen based on multiple criteria, and as per the current algorithm(this is a known issue and it may be fixed at some point) it may not be the cheapest across all the availability zones, but it will definitely be cheaper and at least as powerful as your current on-demand instances.

As you can see below, it launched a bigger m3.medium spot instance in order to replace a t1.micro on-demand instance. This also means that you can get bigger instances, such as c3.large spot instances, as long as their prices is the smallest of the instance types compatible with your base instance type.

beforeInstances
Initial state of the EC2 instances
spotInstanceStarted.png
Launching a new Spot instance, for now running outside the group

The new instance’s launch configuration is copied with very small modifications from the one set on your on-demand instances, so the new instances will be as closely as possible configured to the instances previously launched by your AutoScaling group.

Note: We try to copy everything, including the user_data script, EC2 security groups(both VPC and Classic), IAM roles, instance tags, etc. If you notice any gaps, please report those as bugs.

After the spot instance is launched and running out of its grace period(whatever was set on the AutoScaling group), it will be added to the group, and an existing on-demand instance will be terminated.

The AutoScaling group also adds it automatically to any load balancer configured for the group, so the instance will soon start receiving traffic. In case of instances that start handling requests as soon as their user_data script finished executing, like for example if you are processing data from an SQS queue, that may have already happened a while back, so the instance may already be in use even before being added to the group.

duringInstances
Spot instance added to the group, replacing an on-demand instance

Known bug: At the moment if the group is at its minimum capacity, the algorithm needs another run and temporarily increases the capacity in order to be able to replace an on-demand instance, and this should be more or less harmless assuming that the AutoScaling rules will eventually bring the capacity back to the previous level. Sometimes this can interfere badly with your scaling policies, in which case you may enter a spinning AutoScaling condition. It can be mitigated by tweaking the AutoScaling scaling down policy to make it less aggressive, like by setting a longer wait time after scaling down. This bug should be solved in the next release. This bug was fixed.

Continuing, in the next run, a second spot instance is launched outside the AutoScaling group:

secondSpotInstance.png
Second spot instance was launched

Then, after the grace period passed, it is added to the AutoScaling group, replacing another on-demand instance that is detached from the group and terminated:

secondSpotInstanceAdded
Second on-demand instance was replaced

This process repeats until you have no on-demand instances left running in the group, and you are only running spot instances.

If AutoScaling takes any scaling actions, like terminating any of the spot instances or launching new on-demand ones, we don’t interfere with it. But later we will attempt to replace any on-demand instances it might have launched in the meantime with spot equivalents, just like explained before.

Currently, due to the bug I mentioned previously, my setup ended up in a spinning state, but I managed to stabilize it by increasing the AutoScaling group’s scaling down cooldown period, and it eventually converged to this state: This bug was fixed.

result.png
Final state

Once I eventually release that bugfix, the group should converge to that state by itself, without any changes and much faster. This issue was fixed, the instances should be replaced smoothly.

Conclusions

Many people commented asking how does this solution compare with other spot automated bidders, such as the AWS-provided AutoScaling integration and the spot fleet API, as well as other custom/3rd party implementations.

I think the main differentiator is the ease of installation and use, which you can see in this post. There are a few rough edges that will need some attention, but I’m working on it.

Please feel free to give it a try and report any issues you may face.

Please see my initial post for announcements about software updates.

 

My approach at making AWS EC2 ~80% cheaper: Automatic replacement of Autoscaling nodes with equivalent spot instances

Note

AutoSpotting, the tool described here, was since open sourced and is now available on GitHub.

This post merely states the development history of the tool, it is seriously outdated, and only kept here for historical reasons. It was written in April 2016 and it is describing the state of the tool at that moment. Please refer to the GitHub page for more up-to-date information about the current state of the project.

Getting started

Last year, during one of the sessions of the Berlin AWS meetup where I am often present, during the networking that happened after the event @freenerd from Mapbox mentioned something about the spot market, saying how much cheaper it is for them to run instances there, but also the fact that for their use case it sometimes happened that the instances were terminated in the middle of their batch processing job that prepares the map for the entire world.

A few weeks later, at another session of the AWS meetup, I participated in a similar discussion where someone mentioned the possibility to have instances attached to an on-demand AutoScaling group, which was a feature just released by AWS at that time. I don’t remember if spot was mentioned in the same discussion, or if it was all in my mind, but somehow these concepts got connected and I thought this is a nice problem to hack on.

I was thinking about the problem for a while, and after a couple of weeks I came up with an algorithm based on the instance attach/detach mechanism supported by AutoScaling. I tested it manually and I quickly confirmed that AutoScaling happily allows attaching spot instances and detaching on-demand ones in order to keep the capacity constant, but that it often tries to rebalance the availability zones, so in order for it not to interfere with the automation, the trick is to try to keep the group more or less balanced across availability zones, so that AutoScaling won’t try to rebalance it.

I soon started coding a prototype in my spare time, which is actually my first non-trivial program written in a while, and to make it even more interesting, I chose to write it in golang.

Slow progress

After a few weeks of coding, in which I rewrote it at least twice(and even now I’m still nowhere near being happy with how it looks), I realized it’s quite a bit harder and more complex than I initially thought. Other things happened and I kind of lost interest, I stopped working on it and it all got stuck.

A few months later at the re:invent conference I attended some talks where I met some other folks interested by this problem and I saw other approaches of attacking the problem, with multiple AutoScaling groups, and that was also when I first got in touch with someone from SpotInst who was trying to promote their solution and was sharing business cards.

After re:invent I became a bit more active for a while, I also tried to get some collaborators but failed at it, so I kept working on it in my spare time every now and then and I got closer to get it work. Then I recently had a long vacation, and immediately after I returned I attended the Berlin AWS Summit, where I met the SpotInst folks once again, and it seems they now have a full fledged solution based on pretty much a reimplementation of AutoScaling as a SaaS, they are apparently successful with it. This motivated me to work even harder on this, since my solution is simpler, cleaner and just as effective as theirs.

Breakthrough

After the Berlin AWS Summit, having my batteries charged, I resumed my work and after a few coding nights I managed to make my prototype work. It took much longer than expected, but at least I got there, yay! 🙂

What I have so far

  • An easy to install CloudFormation template that creates an SNS topic, a Lambda function written in golang(with a small JS wrapper that downloads and run it), subscribed to the topic and a few IAM settings to make it all work (update: this was largely simplified since)
  • A golang binary, for now closed source(update: not anymore), but I’m going to open it up once I get it in a good enough shape so that I’m not ashamed of it and after I get all the approvals from my corporate overlords, who according to my employment contract need to approve the publishing of such non-trivial code

 

How does it work

The lambda function is executed by a custom CloudFormation resource when creating the CloudFormation stack from the template, and it subscribes to both your topic and a topic that I run, which fires it every 30 minutes, using a scheduled event.

When my scheduled function runs the lambda function, it will concurrently inspect the AutoScaling groups from all the AWS regions and it will ignore all those that are not tagged with the EC2 tags it expects.

The AutoScaling groups marked with the expected tag will be processed concurrently, on each of them gradually replacing the on-demand instances with compatible spot instances, one at a time. Each run will either launch a single spot instance or attach a launched spot instance to the AutoScaling group, after detaching an on-demand one it is meant to replace. The spot instance is not attached while its uptime is less than the Autoscaling group’s grace period.

The spot instance bid price matches the price of the on-demand instance it is meant to replace. If your spot request is outbid, AutoScaling will handle it as a regular instance failure, and will immediately replace it with an on-demand instance. That instance will later be replaced by the cheapest available compatible spot instance, likely of a different type and with a different spot price.

In practice the group should converge to the most stable instance pricing, no the long term saving about 80% from the normal on-demand EC2 price.

How to use it/Getting started

All you need to do is set an EC2 tag on the AutoScaling group where you want to test it. Any other AutoScaling groups will be ignored.

The tag should have the following attributes:

Key: “spot-enabled”

Value: “true”

See my next blog post for a full installation and runtime walkthrough where you can see very detailed instrunctions on how to get started.

Feedback is more than welcome

If you find any bugs or you would like to suggest any improvements, please get in touch on gitter or file an issue on GitHub.

Warning

This is experimental, summarily tested and likely full of bugs, so you should not run it on production, but it should be safe enough for evaluation purposes.

Anyway, use it at your own risk, and don’t hold me responsible for any misuse, bugs or damage this may cause you.

Update: many of these issues were ironed out since and the tool is currently stable enough for production use cases. It is already in use in dozens of companies, where it is already generating considerable savings. Feel free to also give it a try and report your feedback on gitter.

Later Updates

  • 27 Apr 2016
    • if you want to see it in action, please also check out my next blog post which walks you in detail through the installation and setup process, also explaining the currently known issues and their workarounds as of the time of the writing of this post.
  • 5 May 2016
    • bug fixes since the previous update
      • no longer spinning the AutoScaling group when running at minimum capacity
      • increased runtime frequency to once every 5min to make it converge faster
    • currently known issues
      • when first enabling it on a group with multiple on-demand nodes, it sometimes may launch extra spot instances that do not get added to the AutoScaling group(up to as much as the initial size of the group). workaround: terminate them manually from the AWS console. They will not be re-launched
      • spinning condition when the AutoScaling group is set to a fixed size (Min=Max). Workaround: set Max to Min+1 and disable any scaling actions you may have configured, in order to keep the group at the minimum capacity
    • things currently being worked on
      • fixes for the known issues mentioned above
      • major under-the-hood code refactoring in preparation of open sourcing
      • choosing instance types that are unlikely to be terminated in the near future, based on historical stats data sourced from the Spot Bid Advisor
      • mark spot instances as protected from termination by AutoScaling
      • if you have any other suggestions please write a comment below.
  • 10 July 2016
    • new features
      • improved algorithm for picking the new spot instance type
        • always launch a new spot instance  from the same zone of an existing on-demand instance. Previously the zone was the one where we had the less instances, which often may have been one where we had no running instances and no way . This was causing the bugs about spinning and launch of additional spot instances that were not added to the group.
        • allow multiple spot instances of a given type for each availability zone, as long as their total number is less than 20% of the total capacity from the group. For example a group of 15 instances using 3 availability zones will allow for 3 identical instances per availability zone, but the fourth instance from a zone will be of a different instance type.
      • Lambda wrapper updates
        • rewrote the Lambda wrapper in Python, which makes it more maintainable, since I’m much better at Python than at JavaScript
        • Implement versioning for the binary blob, by downloading the latest version only if not already there, based on the content SHA hash
      • CloudFormation cleanup
        • remove the SNS topic that was never used
      • support the new Mumbai region(still needs testing)
      • internal code refactoring
        • lots of cleanups that make it more maintainable
        • improved logging
    • bug fixes since the previous update
      • it should no longer start additional spot instances, the final capacity should match the original on-demand capacity, unless there were any AutoScaling actions.
      • fixed spinning condition with fixed-size AutoScaling groups by temporarily increasing the group during the replacement process
      • fix choosing of the cheapest compatible/redundant enough spot instance type, previously any cheaper instance type may have been chosen, not necessarily the cheapest.
      • lots of other small bugfixes for various edge cases
    • things currently being worked on
      • open sourcing process was started, and I already got some of the required approvals
      • figuring out how to implement automated testing
    • backlog
      • choosing instance types that are unlikely to be terminated in the near future, based on historical stats data sourced from the Spot Bid Advisor
      • mark spot instances as protected from termination by AutoScaling
  • 13 July 2016
    • Mostly bugfixes, many thanks to @nmeierpolys for some very valuable bug reports, fast feedback and a lot of patience while testing it.
      • improved conversion of on-demand launch configuration fields into spot launch configuration equivalents. In addition to user_data, SSH keys, EBS volumes and many other mostly trivial to convert fields that were previously handled, the following more complex fields should now also be better handled, which make it work on much more real-life environments:
        • EC2 Classic security groups
        • detailed instance monitoring
        • associating public IP addresses
      • It was successfully tested on complex EC2-classic and VPC setups where many of these fields were being used.
      • Compatibility notice: In the likely event that you are using IAM roles on your instances, you need to update to the latest version of the CloudFormation template, since the launch of such spot instances would otherwise fail due to missing IAM permissions required to run instances set up with IAM roles. Again thanks to @nmeierpolys for finding out this issue and proposing the fix.
  • 18 July 2016
    • Improve handling of storage volumes.
      • Bugfix: Fix panic while copying EBS storage configurations.
      • New feature: Implement compatibility check for storage volumes based on the number of attached ephemeral disk volumes present in the launch configuration. For example an instance which has a launch configuration that attached it a couple of the ephemeral SSD drives of a certain size would only be replaced by instance types which provide SSD devices at least as many and of at least the same size each, in order not to violate the storage expectations from the new instances.
  • This is largely outdated, current information about the project can be seen on GitHub