Automatic replacement of Autoscaling nodes with equivalent spot instances: seeing it in action

Over the last few days since my previous post I’ve got thousands visitors, dozens of comments were posted and a few brave souls were even audacious enough to give it a try. Many of you provided valuable feedback and bug reports, so thank you all and keep the feedback coming!

I am now quite busy improving the software based on the feedback I’ve got so far and also on some bugs I found on my own, but before I have anything ready to be released,  I thought I should post some kind of HOWTO that shows how to install and set it up, and shows a demo of the instance replacement process, also exposing the currently known issues you should expect when using it at this point.

It’s still not ready for production usage, but I’m working on it.

Installation

The initial set up is done using CloudFormation, so you will need to launch a new CloudFormation stack. You can launch the stack in any region where Lambda is supported, for example us-east-1 (Virginia) is perfectly fine.

Using the AWS console

Follow the normal stack creation process shown in the screenshots below.

The template URL is all you need to set, it should be https://s3.amazonaws.com/cloudprowess/dv/template.json

installationCloudFormation
Stack creation based on my template

Give the stack a name, then you can safely go through the rest of the process. You don’t need to pass any other parameters, just make sure you confirm everything you set so far and acknowledge that the stack may create some IAM resources on your behalf.

installationCloudFormation2
Naming the stack

 

If everything goes well the stack will start creating resources.

installationCloudFormation4
Creating the stack

And after a few minutes you should be all set.

installationCloudFormation5
The stack is ready

Using the AWS command line tools

If you already have installed the AWS command line tools, you can also launch the stack using the command line, using the following command:

aws cloudformation create-stack \

–stack-name AutoReplaceWithSpot \

–template-url https://s3.amazonaws.com/cloudprowess/dv/template.json \

–capabilities CAPABILITY_IAM

Configuration for an AutoScaling group

The installation using CloudFormation will create the required infrastructure, but your AutoScaling groups will not be touched unless you explicitly enable this functionality, which has to be done for each and every AutoScaling group which you would like to manage.

The managed AutoScaling groups can be in any other AWS region, the algorithm will run on all the regions in parallel, handling AutoScaling groups if and only if it was enabled for them.

Enabling it on an AutoScaling group is a matter of setting a tag on the group, as you can see below:

beforeAutoScaling
Initial state of the AutoScaling group
enabling
Tagging the AutoScaling group where it is being enabled

The AutoScaling group tags can also be set using the AWS command line tools or using CloudFormation, but I won’t describe those ways here.

Walkthrough

Going forward I’m going to show what happens after enabling it on an AutoScaling group.

Once it was enabled on an AutoScaling group, the next run will launch a compatible EC2 Spot instance.

Note: the new spot instance is not yet added to any of your AutoScaling groups.

The new instance type is chosen based on multiple criteria, and as per the current algorithm(this is a known issue and it may be fixed at some point) it may not be the cheapest across all the availability zones, but it will definitely be cheaper and at least as powerful as your current on-demand instances.

As you can see below, it launched a bigger m3.medium spot instance in order to replace a t1.micro on-demand instance. This also means that you can get bigger instances, such as c3.large spot instances, as long as their prices is the smallest of the instance types compatible with your base instance type.

beforeInstances
Initial state of the EC2 instances
spotInstanceStarted.png
Launching a new Spot instance, for now running outside the group

The new instance’s launch configuration is copied with very small modifications from the one set on your on-demand instances, so the new instances will be as closely as possible configured to the instances previously launched by your AutoScaling group.

After the spot instance is out of its grace period(whatever was set in the AutoScaling group), it will be added to the group, and an existing on-demand instance will be terminated. It also gets automatically added to any load balancer configured for your AutoScaling group.

duringInstances
Spot instance added to the group, replacing an on-demand instance

Known bug: At the moment if the group is at its minimum capacity, the algorithm needs another run and temporarily increases the capacity in order to be able to replace an on-demand instance, and this should be more or less harmless assuming that the AutoScaling rules will eventually bring the capacity back to the previous level. Sometimes this can interfere badly with your scaling policies, in which case you may enter a spinning AutoScaling condition. It can be mitigated by tweaking the AutoScaling scaling down policy to make it less aggressive, like by setting a longer wait time after scaling down. This bug should be solved in the next release.

Continuing, in the next run, a second spot instance is launched outside the AutoScaling group:

secondSpotInstance.png
Second spot instance was launched

Then, after the grace period passed, it is added to the AutoScaling group, replacing another on-demand instance that is detached from the group and terminated:

secondSpotInstanceAdded
Second on-demand instance was replaced

This process repeats until you have no on-demand instances left running in the group, and you are only running spot instances.

If AutoScaling takes any scaling actions, like terminating any of the spot instances or launching new on-demand ones, we don’t interfere with it. But later we will attempt to replace any on-demand instances it might have launched in the meantime with spot equivalents, just like explained before.

Currently, due to the bug I mentioned previously, my setup ended up in a spinning state, but I managed to stabilize it by increasing the AutoScaling group’s scaling down cooldown period, and it eventually converged to this state:

result.png
Final state

Once I eventually release that bugfix, the group should converge to that state by itself, without any changes and much faster.

Conclusions

Many people commented asking how does this solution compare with other spot automated bidders, such as the AWS-provided AutoScaling integration and the spot fleet API, as well as other custom/3rd party implementations.

I think the main differentiator is the ease of installation and use, which you can see in this post. There are a few rough edges that will need some attention, but I’m working on it.

Please feel free to give it a try and report any issues you may face.

 

My approach at making AWS EC2 affordable: Automatic replacement of Autoscaling nodes with equivalent spot instances

Getting started

Last year, during one of the sessions of the Berlin AWS meetup where I am often present, during the networking that happened after the event @freenerd from Mapbox mentioned something about the spot market, saying how much cheaper it is for them to run instances there, but also the fact that for their use case it sometimes happened that the instances were terminated in the middle of their batch processing job that prepares the map for the entire world.

A few weeks later, at another session of the AWS meetup, I participated in a similar discussion where someone mentioned the possibility to have instances attached to an on-demand AutoScaling group, which was a feature just released by AWS at that time. I don’t remember if spot was mentioned in the same discussion, or if it was all in my mind, but somehow these concepts got connected and I thought this is a nice problem to hack on.

I was thinking about the problem for a while, and after a couple of weeks I came up with an algorithm based on the instance attach/detach mechanism supported by AutoScaling. I tested it manually and I quickly confirmed that AutoScaling happily allows attaching spot instances and detaching on-demand ones in order to keep the capacity constant, but that it often tries to rebalance the availability zones, so in order for it not to interfere with the automation, the trick is to try to keep the group more or less balanced across availability zones, so that AutoScaling won’t try to rebalance it.

I soon started coding a prototype in my spare time, which is actually my first non-trivial program written in a while, and to make it even more interesting, I chose to write it in golang.

Slow progress

After a few weeks of coding, in which I rewrote it at least twice(and even now I’m still nowhere near being happy with how it looks), I realized it’s quite a bit harder and more complex than I initially thought. Other things happened and I kind of lost interest, I stopped working on it and it all got stuck.

A few months later at the re:invent conference I attended some talks where I met some other folks interested by this problem and I saw other approaches of attacking the problem, with multiple AutoScaling groups, and that was also when I first got in touch with someone from spotinst.com who was trying to promote their solution and was sharing business cards.

After re:invent I became a bit more active for a while, I also tried to get some collaborators but failed at it, so I kept working on it in my spare time every now and then and I got closer to get it work. Then I recently had a long vacation, and immediately after I returned I attended the Berlin AWS Summit, where I met the SpotInst folks once again, and it seems they now have a full fledged solution for the problem, based on pretty much a reimplementation of AutoScaling, using machine learning and with a beautiful UI and they are really successful with it. Funnily enough, they even contacted me to sell that solution to my company and we are seriously evaluating it:-)

Breakthrough

After the Berlin AWS Summit, having my batteries charged, I resumed my work and after a few coding nights I managed to make my prototype work. It took much longer than expected, but at least I got there, yay!:-)

What I have so far

  • A CloudFormation template that creates an SNS topic, a Lambda function written in golang(with a small JS wrapper that downloads and run it), subscribed to the topic and a few IAM settings to make it all work
  • A golang binary, for now closed source, but I’m going to open it up once I get it in a good enough shape so that I’m not ashamed of it and after I get all the approvals from my corporate overlords, who according to my employment contract need to approve the publishing of such non-trivial code

 

How does it work

The lambda function is executed by a custom CloudFormation resource when creating the CloudFormation stack from the template, and it subscribes to both your topic and a topic that I run, which fires it every 30 minutes, using a scheduled event.

When my scheduled function runs the lambda function, it will concurrently inspect the AutoScaling groups from all the AWS regions and it will ignore all those that are not tagged with the EC2 tags it expects.

The AutoScaling groups marked with the expected tag will be processed concurrently, on each of them gradually replacing the on-demand instances with compatible spot instances, one at a time. Each run will either launch a single spot instance or attach a launched spot instance to the AutoScaling group, after detaching an on-demand one it is meant to replace. The spot instance is not attached while its uptime is less than the Autoscaling group’s grace period.

The spot instance bid price matches the price of the on-demand instance it is meant to replace. If your spot request is outbid, AutoScaling will handle it as a regular instance failure, and will immediately replace it with an on-demand instance. That instance will later be replaced by the cheapest available compatible spot instance, likely of a different type and with a different spot price.

In practice the group should converge to the most stable instance pricing.

How to use it

All you need to do is set an EC2 tag on the AutoScaling group where you want to test it. Any other AutoScaling groups will be ignored.

The tag should have the following attributes:

Key: “spot-enabled”

Value: “true”

Update: If you want to see this in detail and also to it in action, please also check out my next blog post.

Feedback is more than welcome

If you find any bugs or you would like to suggest any improvements, please comment below.

Warning

This is experimental, summarily tested and likely full of bugs, so you should not run it on production, but it should be safe enough for evaluation purposes.

Anyway, use it at your own risk, and don’t hold me responsible for any misuse, bugs or damage this may cause you.

Switching to Dell Latitude e7240 from Lenovo x220, trying out Ubuntu 15.04

I’ve been using a Lenovo x220 for the last few years as work laptop and I was pretty satisfied with it, it was a nice piece of hardware.

At some point 2 years ago I had a quite bad bike accident(out of which I was lucky to escape with just an elbow fracture and a few bruises, it could have been much worse…) and the laptop had to absorb a lot of the shock since I basically fell on the back while it was in my backpack. One of the corners was a bit bent and it got some cracks but it was alive and kicking, until last week I finally decided that I would like to have an upgrade.

I’ve had a few Ubuntu issues lately(unity’s lock screen would often fail to allow me to enter a password, random error messages about program crashes, repeated password prompts, random unity panel disappearances) which I blamed on my aged installation, but the last straw was that I got out of disk space on my btrfs partition and the system got unusually slow. I soon saw kernel oops messages in dmesg and soon after MCE errors as well, which to my knowledge is a sign that the hardware is dying.

So I just went to our awesome IT support guys(yeah, I’m not doing that at HERE, we have a dedicated team for it) and after a few minutes of chatter in which I explained what happened, they quickly gave me the latest from our offering in the similar range, a Dell Latitude e7240. I almost took a Lenovo X1 Carbon, but I ended up choosing the Dell because it has support for a real docking station, unlike the Carbon, which is a must if you have a dual-screen setup like I do.

In general, the hardware looks much more polished, the screen is really much better and I could feel it’s slightly faster than the X220. The only things I miss, and quite badly, are the classic Lenovo keyboard and the trackpoint, especially since the Dell trackpad sucks so badly.

I didn’t feel like reinstalling the OS from scratch, especially since I have a quite exotic setup (btrfs on LVM, on top of a full-disk LUKS volume) which took me a while to figure out manually a while back, and restoring from a backup would be slightly slower, so I quickly copied my entire disk on it using dd and netcat, which only took about 45min to complete.

I immediately connected it to my screens using the docking station and tried to configure it so I can resume my actual work, only to notice that both the external monitors are showing the same content and there was no way to separate them.

I did some research and ended up on some forums that claimed this is a known driver bug on my graphics chip on kernels older than 3.17(and the darn Ubuntu 14.10 comes with 3.16). After using the X220 for 3 years with no major driver issues, I really had better expectations from Intel drivers, especially for a year old laptop running the latest available version of Ubuntu. The proposed solution was to update the kernel to at least 3.17, which I did immediately, only to notice that the new kernel fails to even boot, getting stuck while asking me the passphrase to decrypt my LUKS volume.

Since I pretty much had no choice, I then reverted to the previous kernel and decided to try to update Ubuntu to the next development release, which will be launched next month as 15.05, which already comes with the version 3.19 of the Linux kernel and should have the display problem solved.

I then had to free up some space, making room for the upgrade and let it do its thing for a few hours.

Once it was ready, I connected the screens, and it all worked like a charm.

I then thought to give Gnome3 another chance after a few weeks since I tried it last, hoping it would improve, but I was quickly disappointed by its brainless behavior on a triple-screen setup, where having a fixed primary monitor set to the right-most laptop screen really makes no sense(I personally think the primary should follow the mouse, just like in Unity).

I might give KDE another try at some point and I will give my impressions about it, but I will stick with Unity for now, especially since I really love the way they reuse the topbars as menubars on non-maximized windows, and that in generally it feels more polished than in 14.10.

As a bonus, I was pleasantly surprised that Evolution now has a smart push-notification-like update mechanism when used with Exchange, that makes it much more resource efficient than before at checking for new emails.

I’m using it for a few days now and things seem decent, actually surprisingly stable for a development version, but I still see the unity issues with the missing password field in the lock screen and the panel still disappears from time to time, so the issues are still there and not fixed yet.

I’ll do a bit more research and hopefully I’ll get to the bottom of them soon and report the findings in another post.

Nokia N900 community software update fixes desktop annoyances

I love my N900 ever since I bought it, it’s a great device for a nerd like me. Still, as nothing is perfect in this world, I has some things that I don’t like that much about it.

Today I will address two of them, more exactly the fact that the items on the desktop could be placed everywhere, and  the other is the fact that the desktop is forced to landscape mode, when there were many applications that also work in portrait mode.

I am glad to report that today I finally got both of these annoyances fixed on my beloved device, and here’s how I did it.

This morning I applied the latest Community SSU update, which I soon found out that it introduced the support for portrait mode on the desktop. This is very nice stuff, and very easy to use. After applying the update, just rotate the device to portrait mode (when the keyboard is hidden) and you will see all of the content switch to portrait mode. This is not so nice at first, because everything is messed up, but you only need to move the items around and after you switch back and forth between portrait and landscape modes, they will remember where you put them in both orientations. Problem solved!

Besides this issue, as I said, I never liked the fact that moving items was not constrained by anything on the N900 desktop. This looked especially bad after moving all my items to more or less acceptable positions when in portrait mode so that they won’t overlap. After this process, the desktop looked like hell having all those icons unaligned. I shortly got this problem fixed, after applying a suggestion I got from one of the people in the #maemo-ssu IRC channel. The solution was to edit /usr/share/hildon-desktop/transitions.ini and set the following options:

snap_grid_size = 20
snap_to_grid_while_move = 20

There’s currently no UI for these settings from what I know so far, but I would really appreciate if these were included in the cssufeatures application if someone cares enough to do it.

Feel free to use any values  you might see fit, but in my case it worked just fine with 20. After rebooting the device, moving the items on the desktop would align them into a grid, so my desktops look much better now, as you can see in the screenshots below.

The vertical screenshot could only be taken while the desktop was in edit mode, because otherwise the screen would switch to landscape when the keyboard is visible, and I needed keyboard in order to get the screenshot. I know it can be done from the command line, but I was just too lazy.

I hope this is useful to someone. Feel free to post comments to this post containing additional fixes to annoyances you might encounter on this device.

Thank you for reading this and thanks to all the CSSU developers who made this possible.

Cristi

Update: It seems there was yet another minor CSSU release a few hours after the one I was talking about.

Update2: I now discovered that the cssufeatures application is incompatible with the manual changes I did to transitions.ini.

 

Preparing and Ubuntu image for serial console, on a hard-disk connected over USB

It’s been almost a year since my last post, hopefully I will be able to post more often from now on.

This time I’m making a howto on how to install Ubuntu on a SATA disk-drive while having it connected over USB through an USB2SATA adapter, then how to customize Ubuntu so that all the boot messages and the console are directed to a serial port console.

What Am I trying to do?

You might ask yourselves why would you want to do that… Well, I don’t know about you, but I needed this in order to prepare my coreboot development environment on a motherboard that I will only access over Serial port or SSH. Now a bit of history… I’ve been in Berlin for the last three months as part of a business trip, sent by the notorious Finnish mobile phone company that I am working for. While I was at LinuxTag back in May, I finally met the coreboot developers that I’ve been chatting with on IRC ever since 3 years ago and bought myself a coreboot-supported motherboard (Asrock E350M1) from one of the coreboot developers living in Berlin, Peter ‘CareBear\’ Stuge. I bought it because I’ve been planing for quite a while now to build myself a home computer or set-top-box for my TV back at home, and this board seems to be perfect for the job. As a bonus, it is off course running coreboot and quite hacker-friendly.

The coreboot support for this board is still work in progress and although there are a few rough edges, the motherboard is running pretty well, and booting up very fast (under 1 second to the Grub menu). Still, there are a few problems here and there and as a coreboot developer that I like to say I am, although my contributions to coreboot were minor so far, I would like to help getting this board better supported.

The prefered debugging mecanism of coreboot is the serial console because it’s relatively easy to initialize and pretty common. Unfortunately this board doesn’t provide a console port on the back panel, but it has a header with the required pins somewhere on the PCB.

Yesterday me and Peter spent a lot of time working on this board, trying to build a serial header for it and getting it up to speed for coreboot development. We bought some components and then Peter built a nice serial-to-header adapter that also works ad a NULL-modem serial cable since I didn’t have a proper NULL-modem cable.

Then we tried to get an OS running on the board from a SSD drive, but unfortunately the image we had was not properly set up, so we decided to build a new OS installation.

Hardware Setup

As I said so far, I have the Asrock motherboard, a serial-to USB adapter and the custom serial header adapter made by Peter. Besides these, I also have a laptop and a portable laptop SATA hard-drive with an USB-to-SATA adapter.

Software setup

I chose to do it with Ubuntu because it’s easy to set up, quick to install, and pretty nice for development. The hard-disk was connected over USB and I slready had it partitioned, so I only reused the first partition already created there.

I reformatted the first partition to EXT4.

sudo mkfs.ext4 -L rootfs /dev/sdb1

Ubuntu then mounted the first partition to /media/rootfs after double clicking on it.

Installing the base Ubuntu packaged in there. You can replace the architecture to i386 for a 32bit OS, natty with another Ubuntu release, and choose a mirror closer to you.

sudo debootstrap –arch amd64 natty /media/rootfs http://de.archive.ubuntu.com/ubuntu/

After this is done, we can bind-mount some filesystems from the host, preparing for our chroot into the new Ubuntu install.

sudo mount -o bind /dev /media/rootfs/dev

sudo mount -o bind /proc /media/rootfs/proc

sudo mount -o bind /sys /media/rootfs/sys

And finally, chroot

sudo chroot /media/rootfs /bin/bash

Create some config files in the new system

cat << EOF >  /etc/fstab
# device mount type options freq passno
LABEL=root / ext3 defaults,errors=remount-ro 0 1
LABEL=swap none swap sw 0 0
EOF

echo coreboot > /etc/hostname

Set up networking for DHCP

echo -e “auto eth0 \n iface eth0 inet dhcp” >/etc/network/interfaces

Add “restricted universe multiverse” to the line you should have in /etc/apt/sources.list

Install some vital packages

apt-get install linux-image grub-pc

Serial port configuration for Grub

Open /etc/default/grub with an editor.

Comment out

#GRUB_HIDDEN_TIMEOUT=0

Set

GRUB_CMDLINE_LINUX_DEFAULT=”console=ttyS0,115200″

Add these two lines

GRUB_TERMINAL=serial
GRUB_SERIAL_COMMAND=”serial –speed=115200 –unit=0 –word=8 –parity=no –stop=1″

Then you can update the grub configuration.

update-grub

Install grub on the hard-disk

grub-install /dev/sdb

Configure Linux console on the serial port

cat << EOF >  /etc/init/ttyS0.conf
# ttyS0 – getty
#
# This service maintains a getty on ttyS0 from the point the system is
# started until it is shut down again.

start on stopped rc RUNLEVEL=[2345]
stop on runlevel [!2345]

respawn
exec /sbin/getty -L 115200 ttyS0 vt102
EOF

Set a root pasword

passwd

Exit the chroot, unmount all the directories mounted there, connect the hard-disk and the serial cable to the motherboard and enjoy the new OS over the serial console.

~Cristi

Master of Puppets

Hi,

It’s been a long time since my previous post and many good things happened to me ever since. A few months ago I changed my job, moved to a new home, bought a bike, adopted a lovely little cat and today I finally graduated my MSc studies.

Lately I’ve been working on the thesis project, and I’m very glad it’s over. The project implemented a netinstall and configuration management system for a gLite cluster, all written in Puppet and available on my github page

Now that I finished this I can finally spend more time watching Star Trek or hacking on coreboot:), I just got an old RTL8029AS NIC from an ex-colleague (Thanks Serban Cordis!) that I’ll try to get working as a remote debug console in Coreboot and SerialICE just like Rudolf Marek did a while ago.

Cristi

25’th birthday

Hi,
Today I had my 25’th birthday (and Christmas), and we all had a great time together.
I’ve been away for quite a while now, it’s been almost a week with no Internet connection but it seems I was able to survive…
Me and my wife were gone to Berlin for a Rammstein concert, and after driving more than 3000Km in 4 full days, we finally got home. The concert was great, even better than we expected, but the whole trip was very long and tiresome so we had to rest quite a lot after the arrival. Thanks Paula, this was my best birthday present ever!

Merry Christmas to everyone!
Cristi