Read About Stuff We Do. Pagelime Blog.

Content Management for Designers and Agencies.

Amazon EC2 outage: how it affected us, and what we’ve done since

As most of you know by now, we were affected by the Amazon EC2 outage, which resulted in approximately a day of on/off downtime for Pagelime. We’ve communicated openly about it to anyone who reached out, and we sent out a mass email with our personal cell numbers and personal emails. We wanted to make sure to stay open, and available when you needed to reach out to us. I’m going to take time with this post to explain the details behind the outage.

Here’s what happened, how it affected us, and what we’ve done since to mitigate this issue:

  • The North Virginia region of the Amazon Elastic Cloud infrastructure had a major set of issues with their storage: the Elastic Block Store (the EBS). The EBS is meant to be a highly redundant form of storage with very low rates of failure, where any single disk failure should not affect availability of the actual data. Turns out this isn’t exactly the case: the entire block store seems to have become unavailable within a number of regions.
  • A number of web companies were affected, including Foursquare, Reddit, Quora, and HootSuite to name a few. A number of web apps like ourselves assumed the issue would be resolved promptly.
  • Amazon took about a day to repair the issue, at which point service was restored, and things began to operate normally. This puts our current uptime at 99.4% for the year. We need this to be better for both our users and our peace of mind.

Here’s why we use Amazon AWS and not a custom brewed hosting solution:

  • Amazon AWS is fast. It gives us really good response times, and the storage performs very well for the price. Which allows us to keep our costs down for our users, while providing the best service.
  • Amazon AWS is highly available. It allows us to host servers in Virginia, North California, Ireland, and Asia Pacific at the same time, so that a Pagelime server can always be available to close to where you are.  The same goes for their simple storage service: S3.
  • Amazon AWS is highly scalable. We can provision new computing and storage resources very quickly. It puts scaling into our hands.

An ideal setup with AWS should not have failed even in the outstanding scenario we had over the past two days. Here’s why ours failed:

  • All of our data is stored in multiple availability zones (around the world) by both our databases and our data files stored in S3. This worked like a charm, our database immediately failed over to an available instance, and our data was un-affected. This is good.
  • The same goes for servers hosted in multiple availability zones. Only the ones in the US-East zone were affected. This is good.
  • However, for speed of use, Pagelime caches all of the data files, such as content, images, and documents, on Elastic Block Store volumes… the very volumes that failed completely. This cache allows us to quickly publish content without a lot of round-tripping between the database and the servers. This crippled us for a day, and we’re fixing it.
  • Pagelime also runs the publish engine from ONE single destination. The reason is that we want the publish engine to always originate from one IP address, so that firewalls and hosts can white-list us. This publish engine happened to be in the affected zone. We’re fixing this as well.

Soon after the outage happened, we initiated plan B, and began to migrate all of our cache/engine to a different availability zone. This was great as an emergency response, but we want to be resilient to these failures in the future. Here’s what we’re doing to prevent this from happening again:

  • We are purging the publish cache. From now on, the data will be published directly from the data store. This may result in longer load times when you press the publish button, or when you publish an image gallery, but it should reduce the potential of future failures. We unfortunately have to cut this performance optimization for the sake of reliability.
  • We are adding code to the Pagelime application that will actually fail-over in the software itself to different storage models should one appear to be failing.
  • We are creating a backup publish engine in a different part of the world. And for those folks who have bypassed firewalls, we will send this IP out as well, to be added to their web host’s firewalls.

We’ve learned a lot from this. We were really proud of our cloud infrastructure, and the speed / reliability we were getting for the price. After this incident we’re a bit sobered, and we realize that we need to put even more effort into it.

We’re grateful for the outpouring of support we’ve received from you via email. Thanks for standing by us – we’ll make sure to pay it back in kind.

59 Responses to “Amazon EC2 outage: how it affected us, and what we’ve done since”

  1. love app says:

    Thanks for sharing your thoughts about weight lifting belt.
    Regards

    Stop by my web site; love app

  2. In the second case, it is better to use speakers to improve the volume and quality
    of the sound you would get. How to Download Free Movies &
    Music Safely is an article that will help you “to drastically lower the costs of your entertainment habits” with safe downloads of music
    and movies. Some videos will have subtitles or will be dubbed in another language; if these are present the viewer will not be able to turn it off so they
    need to know that they are there and be comfortable with them prior to download.

  3. accent says:

    The tour includes a continental breakfast while you wait to board.

    It should make sure that it respects all patent rights and obligations placed
    upon it. Buy ordering your pain medication from the comfort of your home and
    office you succeed in doing just that.

    Feel free to surf to my web page :: accent

  4. Read More says:

    Everything is very open with a precise clarification of the challenges.
    It was definitely informative. Your site
    is very useful. Thanks for sharing!

  5. Quality content is the important to invite the users
    to pay a quick visit the web site, that’s what this web site is providing.

  6. They are with and springfield illinois mosser
    without any debt. It’s another springfield illinois mosser way to easy
    to replace a pair of the stitches are perfectly hidden under overhanging quilts or
    your shoes. College Park Shoes springfield illinois mosser
    has a lot of pressure on inventory for spring and summer shoes to help you,
    great. Dirty shoes can make their own Suffolk workshop, Bill Straus, ditched ShoeDazzle’s $39.
    Notice the depth of the first 35 people in ancient Egypt to
    the Rittenhouse area, but again I would wear just about
    anywhere.

    My website springfield mossers shoe store

  7. Hello! I’m at work browsing your blog from my new iphone 3gs!
    Just wanted to say I love reading your blog and look forward to all your posts!
    Carry on the superb work!

    Here is my web-site: tirage de cartes

  8. Aiden says:

    Fantastic items from you, man. I’ve understand your stuff prior to and you are simply too excellent.
    I actually like what you’ve bought right here, really like what you’re stating and the best way by
    which you say it. You are making it entertaining and you still care for to keep it smart.

    I can’t wait to read much more from you. That is actually a wonderful website.

    my blog; how to mind control someone; Aiden,

  9. Homeopatía says:

    Hi everybody, here every one is sharing these know-how, so it’s fastidious to read this blog,
    and I used to visit this weblog every day.

Leave a Reply