I Had Downtime Today. Here's What I'm Doing About It.

I screwed up in a major way yesterday evening. This post is part of my attempt to fix it.

This morning I woke up to an email from a paying customer saying that they tried to print cards but couldn’t. Specifically, they said that they were able to use the Print Preview feature, but that using the actual print button, quote, “caused the server to hang.” That can’t actually happen but it was sufficiently detailed as a bug report to immediately clue me in one what probably happened: the Delayed::Job workers must be down. A quick check of the server (ps -A | grep ruby) showed that this was indeed the case.

I quickly restarted the Delayed::Job workers then logged into the Rails console to check how many jobs had piled up. Six thousand.  Oof.  Most of them were low priority tasks (e.g. pinging the Mixpanel server with stats updates, which I do asynchronously to avoid having a failure there affect my users), but sixty users were affected — their print jobs were delayed.  Print jobs normally take under five seconds to execute and are checked with a bit of AJAX magic which polls the server until the job is ready, which means that most of these users probably got an animated GIF spinner to look at until they got tired and closed the web page.  The worst affected jobs took over twelve hours.

Happily, the downtime hit on a Saturday, which is the lightest day of the week for me.  If this had happened a week ago right before Valentine’s Day over 5,000 users would have been affected.

Apologizing To Affected Users

I used the Rails console to create a list of users affected by this, and have sent individual apology emails to the 2 paying customers affected (including attachments for the cards they had tried to print).  I will be contacting the trial users in a more scalable fashion.  Since I don’t have permission to email free trial users (the anti-spam guarantee I give is fairly strict), I dropped the development I had planned for this morning and built a simple messaging system into the site (~20 lines of code — I love you, Rails).  It gives me one-way “drop a message directly to your dashboard” functionality.

For example:

I prefer using this feature to the standard industry responses to outages:

  • “Outage?  What outage?”
  • “Please see our status page, which we’ve conveniently located in electronic Siberia.”
  • “ATTENTION ALL USERS!  0.7% of you were affected by very serious sounding things yesterday!  Please be worried unnecessarily even if you weren’t affected, and swamp our support line, who we will provide no effective tools to to tell you whether you’ve been affected or not!”

It allows me to apologize directly to affected users, makes minimal demands on their attention while still almost certainly reaching them, and does not cause any issue for the other 25,000 users.  Plus I can re-use this feature later in the event of needing to contact specific users without needing to email them (one obvious candidate would be plopping something straight on the screens of anonymous guests if I found something they individually needed to know, for example, if one of my automated processes caught that a recent print job of theirs did not come out right).

Preventing It From Happening Again

I’m something of a fan of Toyota’s Five Whys methodology for investigating issues like this.  (It has recently been popular with the lean startup crew.  My coworkers at the day job enjoyed some mostly justifiable smirks when I told them that.)

  1. Why couldn’t my users print?   Because the Delayed::Job workers were terminated when I upgraded the production server to Ubuntu Karmic Koala last night.
  2. Why didn’t the post-deploy checklist catch that users couldn’t print?  The post deploy checklist has “manually verify you can print cards” on it. I didn’t follow the post-deploy checklist with sufficient attention to detail because it was late (midnight) and I was tired (because I worked a six day crunch week at the day job… 30 days to go).  Here, I used the Print Preview feature to verify that I could print cards (“Hey, it tests the same code path, right?”), not realizing that while it tests the same code path they have different failure scenarios if e.g. Delayed::Job workers are down.  Fix: Quit day job and, regardless of how tired you are, follow the freaking checklist.
  3. Why weren’t you woken up by the Ride of the Valkyries playing on your cell phone when the site failed?  Don’t we have a system in place to do that? It turns out that the automated diagnostic (an external service pings a URL, the URL runs various tests and throws an HTTP error if any fail, the service mails my cell phone if there is an HTTP error twice in a row) tests nginx, mongrel, the D/B, and core program logic but doesn’t test the Delayed::Job processes or sanity check the job counts.  Fixed.
  4. Why didn’t the ‘god’ process monitor detect the workers were down? God sees every sparrow, but god only knows about the processes you tell it to manage, and my god_config.rb file has the Delayed::Job bits commented out with the notation “#This is buggy.”  I don’t remember why it was buggy and my notes in SVN are similarly unhelpful.  New task: unbuggy it.
  5. Why don’t you have commit notes, comments, or a development journal telling you what you were thinking when you found it was “buggy”? Failure to keep adequate records for “minor” changes and failure to follow up on a bug that was prioritized “Eh, get to that whenever” and then never gotten to.  Fix:  Look into beefing up developer documentation practices.

In the course of investigating this I discovered the update to Koala also killed Memcached on the server.  (Thankfully, Memcachedb — where I persist long-term user data that for whatever reason isn’t in the database, such as A/B testing participation data — is on another server.)  Unbeknownst to me, my use of memcached fails totally silently: if Rails can’t find the data in the cache it just regenerates it.  That would have had very unpleasant consequences for users if it had continued until Monday, and none of my automated tests would have picked up on it, because they all ignore timing.  I’ve added an explicit check to see if memcached is up and running.  I’ll also look into doing something about monitoring response times.

What I Learned From Japanese Engineering

I’m indebted to my day job for teaching me both a) how to do this and b) the absolute necessity of doing it, in spite of my longtime cavalierness with software testing. It was quite a culture shock for me the first time I logged into the test server at work to deploy something and got a rap on the knuckles for not:

  • Having a written explanation of exactly what commands I was going to enter.
  • Having a written checklist describing what tests to perform to ensure the deploy worked, and what the expected results would be.
  • Writing in the wiki that I was doing the deploy for a particular version done to close out a particular bug, so that there would be a trail to follow if the version I was about to deploy failed years from now.

That’s what we do for the test server.

All of the writing, test suites, automated test processes, and monitoring takes some time to set up and much of it generates additional overhead on all your tasks.  However, in the last three years, I’ve come to recognize that it is a net time-savings over writing apology letters and doing emergency incident response, neither of which are ever fun or quick.

Alright, development journal entry over.  Back to new development.

No Responses to “I Had Downtime Today. Here's What I'm Doing About It.”

  1. Eivind Uggedal February 21, 2010 at 2:20 am #

    If you don’t mind me asking, what external monitoring service do you use to notify your phone by email in case of a failure when fetching your custom entrypoint which touches your entire system?

  2. Thibaut Barrère February 21, 2010 at 2:31 am #

    Hello Patrick,

    thanks for the detailed (and interesting) write-up.

    One question popped in my mind: what did lead you to upgrade to Ubuntu Karmic Koala ? Is it because the platform bring new specific benefits ?

    thanks,

    – Thibaut

  3. Patrick February 21, 2010 at 3:39 am #

    Eivind: I use mon.itor.us to ping a particular URL and send emails. The email has a custom ringtone added on my cell phone. The URL hits an action in Rails which runs custom code and sends HTTP 500 if anything goes wrong (triggering the email) or HTTP 200 otherwise.

    Thibaut: I was preparing for switching from MemcacheDB to Redis, and Koala has Redis in its repositories while Hardy (what I was running) does not. I prefer to use system stuff from the repos because they’re much less likely to break things then I am.

  4. Jeff Lewis February 21, 2010 at 7:28 am #

    Thanks for the post. It is a great reminder of the importance of following the test plan every time. This is precisely what I am trying to accomplish with http://testplanmanagement.com

    [shameless plug]
    It is still in beta, but there is a single user plan that is free and would help you and others keep track and follow the steps in specific test plans. Feel free to give it a try.
    [/shameless plug]

  5. Giles Bowkett February 21, 2010 at 6:02 pm #

    Seriously, that’s what you have to do for the test server? I want to work in Japan.

  6. Paul Stamatiou February 21, 2010 at 10:44 pm #

    I have thought about creating a “drop a message directly to your dashboard” feature for Skribit for quite some time – or at least beefing up flash[] stuff. I think this post has convinced me to put that on my to do list. Thanks!

    Been following your blog/HN stuff for what seems like forever.

    Best,
    Paul

  7. Torben February 22, 2010 at 2:38 am #

    You don’t often talk about bug *tracking* but often of the *squashing* itself. Are you using any tracking tool to keep tabs on bugs/features/to-dos/inquiries? I’ve seen that FogBugz comes with a free account for 1-2 people; just right for a mISV. I’d like to hear your thoughts about this, or about similar tools?

  8. John February 22, 2010 at 7:21 am #

    So having not heard of the five why’s had to quick do a look up and of course went to the ubiquity site wikipidia. So for someone as skilled at root cause analysis as you I could see the 5 whys being sufficient but had to pull the critique because it just seeems like a very loose framework that may create a culture of root cause analysis but not complete the steps. So from the wiki:

    While the 5 Whys is a powerful tool for engineers or technically savvy individuals to help get to the true causes of problems, it has been criticized by Teruyuki Minoura, former managing director of global purchasing for Toyota, as being too basic a tool to analyze root causes to the depth that is needed to ensure that the causes are fixed. Reasons for this criticism include:

    Tendency for investigators to stop at symptoms rather than going on to lower level root causes.
    Inability to go beyond the investigator’s current knowledge – can’t find causes that they don’t already know
    Lack of support to help the investigator to ask the right “why” questions.
    Results aren’t repeatable – different people using 5 Whys come up with different causes for the same problem.
    The tendency to isolate a single root cause, whereas each question could elicit many different root causes

  9. Patrick February 23, 2010 at 5:31 am #

    >>
    Are you using any tracking tool to keep tabs on bugs/features/to-dos/inquiries?
    >>

    I use the old Mk. 1 pen and paper to keep track of anything that needs to stay in my brain longer than a day and shorter than a week. For example, “What I am going to implement this Saturday.” Email inquiries I answer immediately the next time I check email and star in Gmail if they require follow-up. Everything else gets either blogged, remembered, or forgotten.

Trackbacks/Pingbacks

  1. Elsewhere, on February 21st - Once a nomad, always a nomad - February 21, 2010

    [...] Shared I Had Downtime Today. Here’s What I’m Doing About It.: MicroISV on a Shoestring. [...]

  2. Listen to I Had Downtime Today. Here’s What I’m Doing About It. - MicroISV on a Shoestring - Hear a Blog - February 21, 2010

    [...] http://www.kalzumeus.com/2010/02/21/i-had-downtime-today-heres-what-im-doing-about-it/ [...]

  3. My daily readings 02/22/2010 « Strange Kite - February 22, 2010

    [...] I Had Downtime Today. Here’s What I’m Doing About It.: MicroISV on a Shoestring [...]

  4. High Scalability のホット・リンク集 : Cassandra@Twitter インタビューもあるよ! [ #cloud #cloudcomputing #nosql ] « Agile Cat — Azure & Hadoop — Talking Book - February 25, 2010

    [...] I Had Downtime Today. Here’s What I’m Doing About It by Patrick McKenzie. Awesome deep dive into went wrong with Bingo Card Creator. Sh*t happens. How do you design a process to help prevent it from happening and how do you deal with problems with integrity when they do? [...]

  5. Running A Software Business On 5 Hours A Week: MicroISV on a Shoestring - March 21, 2010

    [...] your testing and QA procedures to avoid it.  When they fail — and they will fail — fix the process which permitted the failure to happen, in addition to just [...]

  6. Building Highly Reliable Websites For Small Companies: MicroISV on a Shoestring - April 20, 2010

    [...] Minimizing operator error is critically important, because you are the least reliable component of your system.  Because you rely on software to do most of the actual work, when you touch the system you’re almost by definition performing something novel that isn’t automated.  Novel operations are moving parts and vastly more likely to fail than known-good operations that your system crunches millions of times per day.  Additionally, even if what you want to do is absolutely flawlessly planned out, you’ll often not execute flawlessly on the plan.  This was one of the root causes of my worst downtime ever. [...]

Loading...
Grow your software business:
(1~2 emails a week.)