It took longer than I’d hoped, but I finally got map tiling moved over to EC2. For a celebration, I thought I’d write a follow-up to my EC2 first thoughts.
The Problem of Persistence
I knew that EC2 instances did not maintain state after being shut down, but I didn’t think it would be a big deal. Shows what I know.
Dealing with a machine that forgets itself every time it spins down is very difficult. Want to add a new apt package? Install a gem? All is good until you flip the power switch and then it’s gone. The obvious and recommended solution is to persist your data in S3. In my case, I chose to do this by re-bundling the image every time I made a change. I had to do this probably 4 times, and while not difficult, it sure was tedious.
Trying to plan ahead, I even added an svn update to our tiler script startup, so as to get the latest code from subversion each time the instance spun up. A few days later we decided to move to git, meaning I’ll have to add git-core to our image and then re-bundle it to S3. Again, more tedium.
The other option is to keep a base image, and re-install all your packages and add-ons on each startup. This is the philosophy of ec2onrails. Each time your instance spins up, you install any number of apt packages and gems via remote ssh script.
Ultimately, I decided against this option for our map tiling. Considering that I expected each instance to be up for an hour only, and that users were waiting for their maps, I wanted to minimize startup time. So, it’s bundle, add something I forgot, and re-bundle.
Keeping track of it all
If you’re planning on lots of starts and stops, you’re going to need a way to handle it automatically. The grempe-amazon-ec2 gem makes the actual commands fairly painless, but you’ll need to store and track the instances you’re running.
One option is to directly use the EC2 gem. While fairly easy to use, dealing with the resulting XML each time gets to be a pain. So, I went ahead and created an ec2_instances table and tracked our instances in the database. We store the private key, instance id, startup time, current state, and a little extra info as well. Using all this (and updating it periodically) allows us to track and manage our instances, making sure that we’re shutting down each one we spin up. Our nightmare scenario was a runaway startup loop that ends us with 1000s of idle instances, but it looks like I programmed it pretty well. If I’m wrong, I’ll have to sell a kidney when our EC2 bill arrives.
Who’s in Charge?
So once you get your instances spinning up, it’s time to decide who’s in charge of starting the work. In our case we had a tiler daemon we wanted to run. The obvious choice is to put a script in /etc/init.d and set it to run automatically on startup.
However, since I like to make things hard, I decided to use a script that would ssh in and start the tiler from there. It gave me a chance to play with the ruby net-ssh library and learn a little about public key authentication.
Theoretically, since we create a private key for each instance we spin up, we should be able to ssh in using this key. If I use OpenSSH from the command line, everything works great. Unfortunately, there’s a bug in net-ssh that requires you have the public key in addition to the private key, making it useless for my purposes. So, I gave up and went back to using password authentication.
Anyways, we had to decide who would be in charge of determining when to spin down an instance. Originally, I intended for the tiler daemon to handle this itself. but I got a little spooked. If the daemon crashed out, the instance would never shut down, and we’d be dropping $2.40 a day for an idle machine.
To combat this, I added a cron job on our main web server that manages the EC2 instances. Every 5 minutes it checks if we have a tiler instance running, and shuts it down if it’s idle and our hour is almost up. It turned out to be a good choice, although the tiler daemon has yet to error out, so either way would have worked well.
Estimate, Multiply by 2
My overall impression of EC2 is very positive. It’s much more complicated that S3, and even FPS, but I suppose that’s to be expected. Considering the incredible flexibility, a little complexity is unavoidable.
When determining how much time it’s going to take you to do something new with S3, estimate and multiply by 2. There are lots of little steps and new things to learn, and tying all the pieces together ended up taking much more time than I expected. Still, I’ve come out with a good understanding of EC2, and now I’m itching to find a project that would allow me to leverage the true power of on-demand cloud computing.
April 20th, 2009 at 8:24 pm
[...] version of persistence, which really is no persistence at all. Some blogs refer to it as a “machine that forgets itself every time it spins down“. However, Amazon has heard the masses, and has released Amazon Elastic Block Store (Amazon [...]