Read About Stuff We Do. Pagelime Blog.

Content Management for Designers and Agencies.

Amazon EC2 outage: how it affected us, and what we’ve done since

As most of you know by now, we were affected by the Amazon EC2 outage, which resulted in approximately a day of on/off downtime for Pagelime. We’ve communicated openly about it to anyone who reached out, and we sent out a mass email with our personal cell numbers and personal emails. We wanted to make sure to stay open, and available when you needed to reach out to us. I’m going to take time with this post to explain the details behind the outage.

Here’s what happened, how it affected us, and what we’ve done since to mitigate this issue:

  • The North Virginia region of the Amazon Elastic Cloud infrastructure had a major set of issues with their storage: the Elastic Block Store (the EBS). The EBS is meant to be a highly redundant form of storage with very low rates of failure, where any single disk failure should not affect availability of the actual data. Turns out this isn’t exactly the case: the entire block store seems to have become unavailable within a number of regions.
  • A number of web companies were affected, including Foursquare, Reddit, Quora, and HootSuite to name a few. A number of web apps like ourselves assumed the issue would be resolved promptly.
  • Amazon took about a day to repair the issue, at which point service was restored, and things began to operate normally. This puts our current uptime at 99.4% for the year. We need this to be better for both our users and our peace of mind.

Here’s why we use Amazon AWS and not a custom brewed hosting solution:

  • Amazon AWS is fast. It gives us really good response times, and the storage performs very well for the price. Which allows us to keep our costs down for our users, while providing the best service.
  • Amazon AWS is highly available. It allows us to host servers in Virginia, North California, Ireland, and Asia Pacific at the same time, so that a Pagelime server can always be available to close to where you are.  The same goes for their simple storage service: S3.
  • Amazon AWS is highly scalable. We can provision new computing and storage resources very quickly. It puts scaling into our hands.

An ideal setup with AWS should not have failed even in the outstanding scenario we had over the past two days. Here’s why ours failed:

  • All of our data is stored in multiple availability zones (around the world) by both our databases and our data files stored in S3. This worked like a charm, our database immediately failed over to an available instance, and our data was un-affected. This is good.
  • The same goes for servers hosted in multiple availability zones. Only the ones in the US-East zone were affected. This is good.
  • However, for speed of use, Pagelime caches all of the data files, such as content, images, and documents, on Elastic Block Store volumes… the very volumes that failed completely. This cache allows us to quickly publish content without a lot of round-tripping between the database and the servers. This crippled us for a day, and we’re fixing it.
  • Pagelime also runs the publish engine from ONE single destination. The reason is that we want the publish engine to always originate from one IP address, so that firewalls and hosts can white-list us. This publish engine happened to be in the affected zone. We’re fixing this as well.

Soon after the outage happened, we initiated plan B, and began to migrate all of our cache/engine to a different availability zone. This was great as an emergency response, but we want to be resilient to these failures in the future. Here’s what we’re doing to prevent this from happening again:

  • We are purging the publish cache. From now on, the data will be published directly from the data store. This may result in longer load times when you press the publish button, or when you publish an image gallery, but it should reduce the potential of future failures. We unfortunately have to cut this performance optimization for the sake of reliability.
  • We are adding code to the Pagelime application that will actually fail-over in the software itself to different storage models should one appear to be failing.
  • We are creating a backup publish engine in a different part of the world. And for those folks who have bypassed firewalls, we will send this IP out as well, to be added to their web host’s firewalls.

We’ve learned a lot from this. We were really proud of our cloud infrastructure, and the speed / reliability we were getting for the price. After this incident we’re a bit sobered, and we realize that we need to put even more effort into it.

We’re grateful for the outpouring of support we’ve received from you via email. Thanks for standing by us – we’ll make sure to pay it back in kind.

25 Responses to “Amazon EC2 outage: how it affected us, and what we’ve done since”

  1. I’m gone to convey my little brother, that he should also pay a visit
    this blog on regular basis to take updated from most
    up-to-date news.

  2. Awesome website you have here but I was curious if you knew of any user discussion forums
    that cover the same topics discussed in this article? I’d really like to
    be a part of online community where I can get suggestions from other experienced people that
    share the same interest. If you have any recommendations,
    please let me know. Appreciate it!

  3. Allison says:

    Thanks for the marvelous posting! I seriously enjoyed reading it,
    you will be a great author. I will ensure that I bookmark your blog and
    will often come back very soon. I want to encourage continue your great job,
    have a nice weekend!

  4. Cooper says:

    If your rental is for several hours or the whole day, consider when your chauffeur will have a break
    or a meal. Treat the bachelorette to a night she will not forget in a limousine.
    If so, you’ll want to choose a limo service with a great reputation for reliability
    and dependability.

  5. You are so cool! I don’t believe I’ve read through anything like this before.
    So good to find somebody with unique thoughts on this topic.
    Seriously.. thanks for starting this up. This web site is one thing that is required
    on the web, someone with a bit of originality!

  6. What’s up everybody, here every person is sharing these experience, tnus it’s good to read this weblog, and I used to pay a quick visit this webpage daily.

  7. There’s defiinately a gret deal to know about this topic.
    I really like all of thee points you’ve made.

  8. Hello, i believe that i noticed you visited my web site so i came to go back the desire?.I’m trying
    to find things to improve my website!I suppose its adequate to make use of a few of
    your concepts!!

  9. www.bing.com says:

    All such apparatus can be easily maintained via the Google
    Apps backup tool. Google API development skills include the integration of Google Maps with existing web systems.
    On the other hand the capitalization department and its investment had also been increasing.

  10. http://market-dcd.com
    Offers, Make Money With Surveys, Win Free Money, The Easy Way To Earn Money Make money online,
    Earn money by complet offer, survey, offers, multi media, ads, get paid , make money
    at home, Paid Offers, How to get money fast, Get Paid To Complete
    Every time you will got money for complete an offers.
    This money you can earn per every offer.

  11. Have you ever considered about including a little bit more than just your articles?
    I mean, what you say is fundamental and all. But think about if you added some great pictures or videos to give your posts
    more, “pop”! Your content is excellent but with pics and
    video clips, this blog could certainly be one of the very best in its field.
    Great blog!

    my web page … rag เถื่อน

  12. notebook says:

    It is slender and comes with large touchpad as compared to other latest laptops in the market.
    Everything is in the right location and provides a
    purpose, leaving your choice of coloration to make the
    affirmation – if that is what you’re after.
    Who, in God’s good name is asking you to use the phone like a normal mobile.

  13. pc doctor says:

    Today, I went to the beach with my children. I found a sea shell and gave it to my 4 year old daughter and said
    “You can hear the ocean if you put this to your ear.” She put the shell
    to her ear and screamed. There was a hermit crab inside
    and it pinched her ear. She never wants to go back! LoL I know this is completely off topic but I had to tell someone!

  14. snack.ws says:

    Quality posts is the secret to invite the people to pay a
    visit the web site, that’s what this web page is providing.

  15. Excellent website. Lots of useful info here. I’m sending it to
    some pals ans also sharing in delicious. And of course, thank
    you for your sweat!

  16. I got this web page from my buddy who informed me concerning this website and now
    this time I am browsing this web page and reading very informative content at this place.

    Here is my site: http://slimgarciniacambogias.com/

  17. Really no matter if someone doesn’t know after that its up to other people that they will assist,
    so here it happens.

  18. My partner and I absolutely love your blog and find almost all of your post’s to be what precisely I’m
    looking for. Would you offer guest writers to write content in your
    case? I wouldn’t mind creating a post or elaborating on a few of the subjects you write related to here.
    Again, awesome weblog!

  19. cream says:

    Hey there, I think your website might be having browser compatibility issues.

    When I look at your blog site in Opera, it looks fine but when opening in Internet Explorer, it has some
    overlapping. I just wanted to give you a quick heads up!
    Other then that, amazing blog!

  20. Elizbeth says:

    2-inch screen in a world that is increasingly adopting a four-inch standard.
    If you like to store your music collection on your
    phone, you will love the Samsung Galaxy S. Some of the best Android phones that are available in the market includes HTC One X, Sony Xperia
    S, Samsung Galaxy Note and LG Optimum Black, to name a few.

  21. There’s always someone who can offer customers more.

    Direct marketers speak about key leverage points to increase profits, and from our Gadgets
    and Widgets example above you can clearly see the leverage points.
    You see when your prospect receives your mail, if they see a stamp on it, they will think
    that it’s something personal.

  22. You can pre-order these fun holiday pieces now to have them mid to late September.
    Be it contemporary, traditional, chic, modern, and hi-fashion you possess all infant clothing you
    want.

  23. pen.io says:

    Amazing things here. I am very satisfied to peer your post.
    Thank you a lot and I’m taking a look ahead to contact you.
    Will you kindly drop me a mail?

  24. Wind can blow away specks of fire and spread around might be quite dangerous continue to
    wonder. The bed was gorgeous as well and contained a beautiful antique 4 poster
    canopy bed (with a mirrored top).

  25. The two games are extremely well-liked during the Apple App keep along with the sport developer may be capable to challenge companies EA Sports activities who have a lot more than 100 games through the store.

    Humans are essence of this job; you deal with their feelings,
    their requirements every single day. Tradition is
    very important to the way of life in Kyoto.

    My website: http://tinyurl.com

Leave a Reply