21 4 / 2011
CTOs - When things go wrong, a checklist.
An excellent list of things to do when it all goes wrong. This morning Amazon EC2/RDS in the US-EAST area went to crap. Crashpadder is down (bummer), and so is Quora, Reddit and Foursquare. I’d add a few to the list:
-
Your users don’t know what any of it means. Get your comms team with a simple clear message of what’s going on, and how it affects the most important things for your users. (In our case, their pending bookings and response-rates).
-
Get in your helicopter. From up in the air, it’s rarely as bad as it seems. It’ll be over in a few hours (touch wood), and you’ll learn the lessons needed.
-
Record what you do. You’ll be hacking a few things, trying to get things back and working. When you come to the blissful nirvana of the coffee-fuelled system restore, you need to know exactly what you’ve changed, and think about what else is affected.
Any one else got any tips to share?
Permalink 6 notes