21 4 / 2011

An excellent list of things to do when it all goes wrong. This morning Amazon EC2/RDS in the US-EAST area went to crap. Crashpadder is down (bummer), and so is Quora, Reddit and Foursquare. I’d add a few to the list:

  • Your users don’t know what any of it means. Get your comms team with a simple clear message of what’s going on, and how it affects the most important things for your users. (In our case, their pending bookings and response-rates).

  • Get in your helicopter. From up in the air, it’s rarely as bad as it seems. It’ll be over in a few hours (touch wood), and you’ll learn the lessons needed.

  • Record what you do. You’ll be hacking a few things, trying to get things back and working. When you come to the blissful nirvana of the coffee-fuelled system restore, you need to know exactly what you’ve changed, and think about what else is affected.

Any one else got any tips to share?

Tags:

Permalink 6 notes

  1. danhilltech posted this