I’ve been using the Amazon Elastic Computing Cloud (EC2) for running our web servers at NetCrafters for almost a year now. It’s been an amazing experience and I’ve kept a detailed account of many of the lessons learned.
The most recent challenge came when our servers just started locking-up for no reason. The sites would still be responding so we knew the LAMP stack was still limping along, but they were completely unresponsive via SSH. The only way to regain control was to issue a reboot through the Amazon API command line tools. Even this would sometimes take two or three times before the server would cycle.
Once I was able to login again, and look through the logs, there was absolutely nothing to go on…. until finally the server completely locked-up. Apache was not responding and we could not open a secure shell. Once the server was rebooted and came back up, there was an error in /var/log/dmesg:
kernel BUG at arch/i386/mm/pgtable-xen.c:306!
Unfortunately it seems that very few people are able to get this message because they are never able to recover the server once it happens. I was lucky and had this tiny shred of evidence to go on. After hours of searching I finally found that the kernel our instances were using (2.6.16) had been declared unstable for large CPU instance types. Originally these instances were started as m1.small and I had indeed converted them to c1.medium instances.
The Solution: Upgrade from the 2.6.16 kernel to the 2.6.18 (to be exact: vmlinuz-2.6.18-xenU-ec2-v1.0). Unfortunately, this too is a hard to find procedure, thus the long awaited point of this article:
How to Upgrade an EC2 Instance to 2.6.18 from 2.6.16
First, I recommend reading about Amazon EC2 User Selectable Kernels. I will outline the steps I took here, but if you want to know what you’re doing (versus just typing exactly what I tell you to!), please read at least the first paragraph or so then come back.
You’ll need the latest API and AMI tools installed on the instance, if you don’t know how to do that, search the AWS developer forums. There are plenty of tutorials already there, however, if you’re running Debian, there is a great post on the forums here for installing the AMI tools as a Debian package (this worked flawlessly for me).
Start by launching your instance from your existing AMI image (I’ll be using an arbitrary AMI: ami-99999999 as an example. Be sure to replace with your registered AMI), using the new Kernel (specified by the –kernel aki-9b00e5f2 directive), and the appropriate RAM disk (specified by the –ramdisk ari-67b95e0e).
ec2-run-instances ami-99999999 -k gsg-keypair --kernel aki-9b00e5f2 --ramdisk ari-67b95e0e -t c1.medium
This will start a new instance, using the vmlinuz-2.6.18-xenU-ec2-v1.0.i386 kernel (aki-9b00e5f2) and the initrd-2.6.18-xenU-ec2-v1.0.i386 RAM disk (ari-67b95e0e). These are both for a 32-bit instance, if you’re running 64-bit instances, you’ll want to use aki-9800e5f1 for the kernel and ari-64b95e0d for the RAM disk. You can see the list of AMI, ARI, and AKI available by issuing the following command:
ec2-describe-images -o amazon
The -t c1.medium is used to describe the instance type (number of processors, RAM, etc). There are others available:
Once the instance is up and running, login via SSH and we’ll need to install the new Kernel modules:
First verify you’re actually running the new kernel:
ec2# uname -a
Then, install the new kernel modules:
tar xzf ec2-modules-2.6.18-xenU-ec2-v1.0-i686.tgz -C /
Next, we need to install udev. You may not need to do this, but we are running Debian servers and udev is required to make the new image bootable. Otherwise, none of the devices will mount and the new image will just hang at boot. This is really easy though:
apt-get install udev
# prevent udev from keeping its old ip-adress
Next step is to rebundle and register your new image. The new image will use the 2.6.18 kernel and RAM disk by default. If you’re a bit rusty on this process, I recommend the EC2 Getting Started Guide. Once you’ve done this, launching the newly registered AMI will launch your instance with the new 2.6.18 kernel and RAM disk.
NOTE: Each time you rebundle, you’ll need to re-issue the following command or udev will keep it’s IP address and the new image will not get a new IP via DHCP at first boot:
# prevent udev from keeping its old ip-address
I wrote this a week or so after actually doing it, so I’m going mainly from rough notes and memory. If you’re having a specific error while working through this, let me know and it’ll probably knock something loose and I’ll have an answer for you. This took nearly 10 hours to aggregate the steps for making this work. Once you have the steps in place though it only takes about 10 minutes. I sure hope to save someone else all that trouble.