I Saw An Extremely Subtle Bug Today And I Just Have To Tell Someone

This post will not help you sell more software. If you’re not fascinated by the inner workings of complex systems, go do something more important. If you are, grab some popcorn, because this is the best bug I’ve seen in years.

Have you ever been logged into a site and get suddenly asked to log in again?  This is one of those minor nuisances of using the Internet, right?  If you’re not technically inclined, you think “Hmm, ghosts in the Googles.  Oh well, here is my username and password again.”  If you are technically inclined, you might think “Hmm, funny, my cookie picked a bad time to expire, or maybe my session was getting stored in Memcached on a machine which just went down.  Oh well, here is my username and password again.”

It turns out that Bingo Card Creator has been doing this pervasively to a fraction of my users for the last few months.  I never noticed it, and no one ever complained.

Here’s the scenario: Bingo Card Creator is a Rails application, originally coded against Rails 2.1.X and then gradually updated with Rails security releases.  Like many Rails applications, it stores sessions in a cookie (using CookieStore), and uses the session to hold only very limited data.  Specifically, it holds the (critical) user ID for logged in users and the (nice to have) pre-login session ID.  I use the pre-login session ID to tie some analytics stuff together on the back end — basically, it lets me associate newly created accounts with search terms and whatnot that bring them to my site.  The exact mechanism for doing that isn’t important to this bug — you just need to understand that the session resetting is a minor nuisance if it only happens once in a blue moon, and a huge nuisance if it happens pervasively.

Subtle Indications My Site Was Borked

BCC maintains a whole lot of internal analytics, because I’m something of stats junkie.  Because BCC is in maintenance mode this year, I don’t actually view the stats on a regular basis — as long as the server stays up and users don’t have any complaints, I let the sleeping dog lie.  (I’ve been busy with other projects.)  Anyhow, one example of such a stat is “Of recently created trial accounts, how many were referred from the Halloween bingo cards mini-site?”  For most of the year, that should be a negligible number.

Except right about on Halloween, when the mini-site sees on the order of 30,000 visits or more.  This usually sells several thousand dollars worth of software.  That is fairly hard to miss, because if several thousand dollars don’t show up in my bank account, I’d know right away.  (Sidenote: I did lose about $1,000 due to an ill-timed server crash while I was on a cross-continental plane ride right during the middle of the pre-Halloween rush. Oof.)  So naturally, several thousand dollars implies a hundred or more sales (at $30 each) which implies thousands of trials, right?

Well, my internal analytics code was telling me that the Halloween site had referred ~100 trials of which 6 converted.   Which means that I should have expected a $200 bump in my bank balance.  Which was not what happened.

I mentally filed this away under “Hmm, that’s odd” but didn’t investigate immediately because I had not lost any money (or so I thought) and was busy that week.  Then recently, after doing an unrelated code push (I integrated Stripe, it is awesome, full details later), I did my usual post-deploy smoke test and, after creating a new account, I got suddenly logged out of the application.

“Hmm, that’s odd.”  And I tried it again, twice, couldn’t produce the error, and mentally wrote it off to gremlins.

In Which I Become Doubtful Of The Existence Of Gremlins

Four hours ago, my brain suddenly decided to put these facts together. The discrepancy for the sales statistics strongly suggests that, prior to accounts getting created, the session was getting cleared.  This meant that, when the account actually got created, the referrer was not associated with the account in the DB, which threw off subsequent stats gathered by my internal analytics.  Sessions getting randomly cleared would also cause the user to become instantly signed out.

I tried to reproduce the problem in development mode and was pervasively unable to do so.  Then I started trying to reproduce it on the live site and was able to, sporadically, but only in Incognito Mode in Chrome, and only if I clicked fairly fast.  (Don’t ask how many dozens of tries it took to figure out that fast clicking was the culprit.)

Having verified that it actually existed, I added instrumentation to tell me what my session ID was, and noticed — like expected — that it changed when I was suddenly logged out.  Sure enough, the session was getting wiped.  But why?

Racking my brains to figure out “What could reset a session in Rails other than explicitly trying to do it?”, I started listing up and discarding some candidates:

  • The cookie expired in the browser — nope, expiry was set correctly
  • The cookie got eaten by the hop from Nginx to Mongrel — nope, after investigation, cookies always matched on both sides (like expected)
  • The cookie got too big and failed to serialize properly — nope, after looking through the Rails codebase, that looked like it would throw an exception
  • The cookie got reset when Rails detected malicious behavior coming from the browser — bingo!

CSRF Protection: When It Breaks, It Breaks Very Quietly

Cross-site request forgery (CSRF) is tricking the browser with a malicious (or compromised) site B to access something on site A.  Since requests for site A will carry A’s cookie whether requested by A or not, an image tag or embedded Javascript on B can do anything on A that a logged-in user can do, like accessing /accounts/wire_all_your_money_to_switzerland with the appropriate POST parameters to make it happen.  This is, to put it mildly, a bad thing.  Rails has lovely magic which defends against CSRF for you: all you have to do is include two lines of code
#In application_controller.rb
protect_from_forgery

#In your templates' HEAD somewhere

Rails will then basically generate cryptographically secure random number, totally transparently to you. This is called the CSRF token.

One copy goes in your Rails session, where only your server and the client can see it.  (n.b. Rails sessions are a bit opaque since they are Base64 encoded, but they can be trivially decoded by anyone who can read the cookie, including the end-user.  They can’t be forged because of another security feature, but don’t put anything you don’t want the user to know in the session.)

Another copy goes in the document’s HEAD (for access via Javascript) and in Rails-generated forms as a hidden value.  When Rails makes a PUT or POST request to the server (via helper-generated form or helper-generated Javascript), Rails will submit the copy included in the HTML code with the request, compare it to the one in the session, and bounce requests where they don’t match. Bad actions on other sites shouldn’t be able to read either a) a page on your site (the same origin policy prevents this) b) the contents of your cookie from your site, so this is secure.

The specifics of how it “bounces requests” are very important.

Point Releases Sometimes Contain Doozies

My personal understanding of Rails up until an hour ago was that a CSRF violation would raise an exception.  This would practically never get seen by a legitimate user operation, so few people are aware of that, but I had seen it a time or two when security auditing BCC.  (Some of my geeky friends had, back in the day, exploited BCC with a CSRF and helpfully told me how to fix it.  Naturally, after fixing it I verified that the site worked as expected with the fix.)

So if the CSRF protection was somehow eating sessions, I would expect to see that exception getting logged and emailed to me by Airbrake (formerly Hoptoad — it emails you when an exception happens in production, highly recommended).   That wasn’t happening.

Then I decided to dig into the Rails source.  Whereupon I learned that Rails 2.3.11 changed the behavior of CSRF protection: instead of throwing exceptions, it would silently just clear the session and re-run the request.  For most sensitive operations (e.g. those which require a signed in user), this would force a signout and then any potentially damaging operation would be averted.

Here’s the relevant code in Rails 2.3.11:

def verify_authenticity_token
  verified_request? || handle_unverified_request
end

def handle_unverified_request
  reset_session
end

Versus the relevant code in Rails 2.3.10 (sidenote: you can see all of this easily in Github because Rails is diligent about tagging releases, a practice you should certainly follow in your own development):

def verify_authenticity_token
  verified_request? || raise(ActionController::InvalidAuthenticityToken)
end

And, sure enough, checking Subversion showed that I upgraded the version of Rails I was using in January of this year in response to this security advisory. I read that, made the required modifications to my application, tested, and had no problems.

So What Went Wrong Then?

After I was sure that sessions were being reset (but only in production), I added a bit of instrumentation to the live site to record the session ID for people coming from my IP address and to log when it changed. This let me find the culprit: a bit of Javascript that A/Bingo, my A/B testing library, uses to verify that people are human. It assumes that robots generally won’t run Javascript engines capable of doing POST requests, so it does an ajax-y POST to my server to assert humanity of the end-user, thus keeping almost all bots out of my stats.

That code has been live over a year. Why did it suddenly start causing session resets? Oh, another change in the 2.3.11 upgrade:

The old code:

  # Returns true or false if a request is verified.
  # Comment truncated by Patrick
  def verified_request?
      !protect_against_forgery?     ||
        request.method == :get      ||
        request.xhr?                ||
        !verifiable_request_format? ||
        form_authenticity_token == form_authenticity_param
  end

Notice that request.xhr? will cause this request to be verified if it evaluates to true, regardless of the other things in the OR statements. request.xhr? tests whether a request is ajax-y in nature. A/Bingo’s humanity-verifying POST is, so it didn’t trigger the CSRF check.

The new code, however:

  # Returns true or false if a request is verified.
  # Comment truncated by Patrick
  def verified_request?
    !protect_against_forgery?                            ||
      request.get?                                       ||
      form_authenticity_token == form_authenticity_param ||
      form_authenticity_token == request.headers['X-CSRF-Token']
  end

Yep, as announced in the patch notes, we lost the exemption for XHR requests. So the A/Bingo mark_human request will, because it makes no particular effort to include a CSRF token (which I will be changing very quickly, as A/Bingo is my project), with certainty cause the CSRF check to fail in 2.3.11. This will result in not a noisy exception (the previous behavior) but instead a silent reset followed by re-running the action. A/Bingo, which doesn’t care a whit whether you’re logged in, will then mark your freshly new session as human. If the previous contents of your session mattered, for example to keep you signed in, they are now gone. A/Bingo will not reaudit your humanity, though, because your session now marks you as human, so this will only ever happen to your browser once.

Race Conditions: Not Just In Java Thread Programming

So why did this never show up in development and why did it show up only sporadically in production? Well, consider how a browser interprets a page presented to it: it first downloads the HTML, then downloads the assets, blocking when it discovers e.g. CSS or Javascript which alters the document. This means that Javascript very low on a page may never execute if someone above it blocks them until the user navigates away. (This is a pretty gross simplification of how multiple pieces of very complicated and often incompatible software do something very difficult. If you want details, read stuff by the people behind YSlow. They’re geniuses and taught me all that I successfully learned about this process.) Someone like, say, external analytics utilities loaded over the public Internet. My page includes a few such scripts, like Google Analytics and CrazyEgg. They are off in development to avoid polluting my stats.

This plus the lack of network latency means that, on development, a browser which sees a page that includes the humanity testing Javascript will almost certainly execute it. That will cause the session to be burned, once, on the first page load. Since my invariable workflow for manual testing is “Start with a new browser at the homepage or landing page, do stuff, make sure it works”, the order of execution is:

  1. I load the front page or a landing page. The session is initialized to some value S1.
  2. (A few milliseconds later.) The A/Bingo Javascript checks for my humanity, resetting the session to some new value S2.
  3. I hit the registration or login button, and the site works as I expect it to.
  4. Since the site knows I am human now, that never gets checked again, and the session never gets reset again.

In production though, the workflow could very well be:

  1. The user arrives at the front page or landing page. The session is initialized to some value S1, including (say) their referrer information.
  2. A bunch of Javascript starts loading ahead of the A/Bingo check.
  3. The user, within one or two seconds (depending on network latency to those external scripts), either logs in or creates an account.
  4. The browser never successfully executes the A/Bingo check.
  5. The user arrives at their dashboard. When it is rendered, the server (robotically) decides it isn’t quite sure if they are human yet, and includes that Javascript again. (This behavior is designed because I was aware of the timing issue, I just didn’t realize how it would shake out with the 2.3.11 upgrade.
  6. This time, the user ponders their dashboard enough for the A/Bingo Javascript to post successfully. This resets their session to some new value S2.
  7. The user clicks anything on the page, and (because S2 doesn’t include their logged in user ID) gets taken to a login screen.
  8. The user is now durably marked as human, so the A/Bingo check never fires again, preventing a second unceremonious logout.

This neatly explains the logged out users. How to explain the missing referrer information? Well, if the user is NOT very fast on the click on the first page, they’ll have their referrer cleared out of the session before they successfully signup. They’ll get marked as a human prior to creating their account, though, so they’ll never even notice the unceremonious logout. This is the behavior of the overwhelming bulk of new users, which is why the stats were getting comprehensively borked but almost no users thought to complain.

This difference in behavior based on the hidden interaction of two concurrent processes is called a race condition. Race conditions are why sane programmers don’t program with threads or, if they do, they use shared-nothing architecture and pass all communication between the threads through a message queue written by someone who knows what they are doing (if you have to ask, it isn’t you — seriously, multithreaded programming is hard). I haven’t seen a race condition in years, because the genre of web applications I write and their architectures makes me mostly immune to them. Well, I just got busted back to CS102. Sadly, the core lesson of CS102 hasn’t changed: reasoning through why race conditions happen is very hard.

Saved By Unsophisticated Users, Sort Of

Users returning after the session naturally expired (2 weeks) would go through the dance again, potentially getting asked to log in twice. However, it took most of them enough time to have the human check prior to finding where the Sign In button was, so the percentage of users who actually visibly saw the bug was fairly small. (I’m guessing, from a quick heuristic run on my log files, that it was below 1% of accounts. That’s the optimistic way to say it. The pessimistic way is to say that this bug negatively affected the better part of a thousand people, and probably cost me sales from some of them.)

Whose Fault Is This?

If my users are inconvenienced, it is my fault, always. I should have read the patch notes for 2.3.11 more diligently, to discover the very consequential line “In addition to altering the templates, an application’s javascript must be changed to send the token with Ajax requests.”, and I should have been more aware that there was a one-line Javascript method pulled in by a library (which I wrote, so that is no excuse) which was not automatically upgraded with the Rails helper methods.

I’m not sure if more diligent testing would have caught this. Race conditions are hard to diagnose, and while I might have caught it by including production levels of external Javascript in my development environment, the symptoms would only have been visible a fraction of the time anyhow, and in ways which didn’t visibly break the application most of the time. (Who checks their stats for the development version to make sure they’re sensible after implementing that function correctly the first time?)

What I really should have done about this is addressing it earlier, when I first got the inkling that there was some weird edge case which would cause a logged in user to become logged out. I futzed around with my configuration once or twice and saw the problem go away (because it was non-deterministic), but rather than futzing I should have figured out a complicated but reducible series of steps that would always cause the issue. That would have sent me down the right road for fixing it.

So How Do You Address This

Immediate term, a one-line patch turns off CSRF protection for the A/Bingo mark_human action, preventing it from accidentally resetting the session.

skip_filter :verify_authenticity_token, :only => :mark_human

I also added a note about this to the A/Bingo documentation. I’ll patch A/Bingo after I have enough brain cells left to do that in a way which won’t break anyone’s applications. After I patch A/Bingo, that work-around won’t be necessary.

Why’d You Write This Post?

Because, after hours spelunking in Firebug, my codebase, and the innards of obsolete version of Rails to understand what was happening, I had to tell somebody. Some people have water coolers. I have the Internet. Hopefully, someone in this wide world will find this discussion useful.

If you’re wondering what the day-to-day life of an engineer is like or why it’s so dang hard some of the time, this might be a good example (of the pathological case — the typical case is writing boring code which solves boring problems, like laying out a 5×5 grid on a bingo card and randomizing the word order). Bingo Card Creator is not terribly complicated software when compared to most applications, but it sits on top of other pieces of code (Rails, the web server the browser, the TCP/IP stack, the underlying OS, the hardware on both ends, etc) which collectively are orders of magnitude more complicated than any physical artifact ever created by the human race.

Most of the time that complexity is abstracted away from both the user and the developer, both as blissfully ignorant of the layers below as an ant walking on an aircraft carrier is ignorant of the depth of the ocean. But when a problem bubbles up and writing it off to gremlins isn’t getting the job done, you have to start looking at the lower levels of abstraction. That is rather harder than dealing with just the higher levels of abstraction. (Joel Spolsky has an article about this subject.)

About Patrick

Patrick is the founder of Kalzumeus Software. Want to read more stuff by him? You should probably try this blog's Greatest Hits, which has a few dozen of his best articles categorized and ready to read. Or you could mosey on over to Hacker News and look for patio11 -- he spends an unhealthy amount of time there.

43 Responses to “I Saw An Extremely Subtle Bug Today And I Just Have To Tell Someone”

  1. Ron November 17, 2011 at 11:52 am #

    > Hopefully, someone in this wide world will find this discussion useful.

    I did, and I don’t even use Rails :-) Thanks for taking the time to write it up!

  2. Jon November 17, 2011 at 11:59 am #

    A familiar story, but well written. +1 for finding it at all.

  3. Mike Ferrier November 17, 2011 at 12:12 pm #

    Thanks for sharing this.

    Why is it so satisfying to read someone else’s story about how they drilled down through the layers of logic to fix a very subtle bug? It may trigger memories of having done it yourself in the past, and you get a little echo in your sense memory of the intense satisfaction you felt at the time.

    Whatever the reason, it’s a great feeling.

  4. Paul November 17, 2011 at 12:12 pm #

    Fascinating! Thanks. I don’t use Rails, but can appreciate the work that went into finding this. I’ve also been on both sides (creator and victim) of similar cascades of changes in the past, so I know just how much work these sorts of things can take.

  5. Grace November 17, 2011 at 12:20 pm #

    You left out the link for “relevant code in Rails 2.3.10″

    Very fascinating reading about how someone else hunted down a mysterious bug :)

  6. Andrew November 17, 2011 at 12:33 pm #

    Hi,

    I love bug war stories. I haven’t read through the whole thing yet, but I had a similar problem a while ago. Similar at least in that it only affected a small subset of my logged-in users. For these lucky users, my pages would load without any references to my site’s CSS files so they got to see the “1994″ unstyled version of the site. T’was a nice one to track down.

  7. Marco November 17, 2011 at 12:53 pm #

    Thanks for the great post.

    I saw an extremely subtle typo, and I just have to tell you:

    Borked -> Broked

  8. Drew November 17, 2011 at 1:00 pm #

    Fun read, thanks Patrick. I’m looking forward to your post on Stripe – I just started integrating it into our own payment processing code and I love it.

  9. LoydB November 17, 2011 at 1:18 pm #

    Great read, and good hunting!

  10. yiggity November 17, 2011 at 1:43 pm #

    Very good writeup. Thanks for taking the time.

    The interwebz are my water-cooler too. :)

  11. Teresa November 17, 2011 at 1:57 pm #

    Thanks for writing this up. I always find bug stories fun to read, mainly because this time I’m not a active participant in the story lol. Bravo!

  12. Anton Zolotov November 17, 2011 at 2:45 pm #

    This was very useful, because it helped me to figure out a bug with a Facebook app and CSRF protection. Thanks Patrick!

  13. Michael Shapiro November 17, 2011 at 3:02 pm #

    I just recently stumbled upon the same issue with CSRFs in Rails. We’ve got an API-based service with a client-side app making POST requests. It took forever to track down, but we fixed it by calling ‘form_authenticity_token’ before we respond to the “launch” action that sends back the client app.

  14. Ryan November 17, 2011 at 4:16 pm #

    I’m not a Rails person either and I found this useful. Been there so many times when hours after I saw a passing bug, initially I write it off, then come to realize it was the key to not one but about four different seemingly unrelated problems.

  15. Johannes November 17, 2011 at 4:30 pm #

    “Some people have water coolers. I have the Internet.”
    You rock.
    Gread article!

  16. Max November 17, 2011 at 4:34 pm #

    Frankly, I didn’t understand half of this post because I don’t do much back-end coding, only silly front-end dev and design. :)

    Even not understanding half of what you were talking about, I completely appreciated it and enjoyed the read – thanks for writing it! (sometimes those long ones just end with you thinking… “I wrote HOW much???”)

    I admire your persistence, and your successful foray in to niche web apps is motivating my own. Thanks again :)

  17. Edward Anderson November 17, 2011 at 4:40 pm #

    Yep, happened to me too, though in a different scenario. Glad you figured it out! It took me using a debugger stepping through Rails callback code to figure out what was happening.

  18. Joseph McCormick November 17, 2011 at 5:07 pm #

    Funny you should mention this, as I’ve been tackling the same problem this week to deal with testing quirks. What prompted it? Well, I am building a bit of a service webwork internally, and our applications have been trying to talk to each other. CSRF tokens have puzzled the hell out of me a lot recently.

  19. Calyth November 17, 2011 at 9:44 pm #

    Fun read. This piece didn’t speak to me at a Rails / Ruby level – it spoke to me at the persistence at development / testing level.

    It’s easy to write something off as gremlins in the works, but we all know that gremlins are one of the toughest bugs out there. Glad to see the effort to piece everything together :)

  20. G Snyder November 18, 2011 at 1:21 am #

    Great post – thanks for writing this up.

  21. Piskvor November 18, 2011 at 1:28 am #

    Fascinating. Always good to see a mysterious error exposed, explained and fixed. Just a footnote: web robots are learning to POST through XHR, too (in blatant violation of the relevant RFCs, IMNSHO), so the is_human heuristic probably won’t hold for too long.

  22. Weng Fu November 18, 2011 at 5:10 am #

    Could please make the example with the Visual Basic?

  23. Jeff November 18, 2011 at 7:59 am #

    Wow, this *completely* explains the behavior I got when I attempted to upgrade a project from 3.0.3 to 3.0.4. Thanks a million for the explanation!

  24. Caley Woods November 18, 2011 at 8:44 am #

    It looks like Rails 3 is fairly similar except if verified_request? is false it does a logger.warn and then executes handle_unverified_request which executes reset_session.

    So if you ever go to Rails 3.0.x it currently looks like there would be a warning in the logs stating “WARNING: Can’t verify CSRF token authenticity”.

    Great write up!

  25. Hap Holiday November 18, 2011 at 9:00 am #

    I’m not a dev and I found this writeup to be interesting and concretely understandable to me. Nicely done.

    I did have to look up “race condition.”

  26. Phil November 18, 2011 at 9:04 am #

    Excellent article. Writing multi-threaded, multi-process code in my day job, it’s easy to appreciate the subtlety and soul-destroying properties of tracking down such a bug ;)

  27. MK November 18, 2011 at 10:28 am #

    you’re a hero; juggling rails guts like that. I learned something. Thanks.

  28. raph November 18, 2011 at 11:14 am #

    very interesting to read.
    I really like your final considerations on the complexity of the layers we build on everyday.

    good hunting! :)

  29. finite November 18, 2011 at 11:34 am #

    A nice read, cheers for posting it.
    I love finding bugs like this, it’s a very enjoyable feeling when the penny drops and you suddenly understand it all.

  30. Adam Pieniazek November 18, 2011 at 2:19 pm #

    I am technically inclined and have thought “ghosts in the Googles”. It’s chaos theory in full effect.

  31. lolaturfaec November 19, 2011 at 6:12 am #

    lol ur stupid

  32. Howard November 19, 2011 at 7:07 pm #

    Great story, Patrick! I once had to deal with deadblock in a multithreading Ruby script. I was actually just using EventMachine, and spinning workers to do tasks. It turned out that there was a race condition in the postgres C extension.

    That’s day I swore never to use Thread in Ruby again. Now whenever I need concurrency, I just spin out processes. Redis is my hammer.

  33. Vadim P. November 19, 2011 at 9:10 pm #

    Great read, thanks!

  34. Pete November 19, 2011 at 9:11 pm #

    Hi Patrick – minor note, I think that everywhere you used the word “pervasively”, you probably meant “persistently”.

    A bug might be considered “pervasive” if it showed up in every part of your application (or was visible in many different parts of your source code) – while “persistent” would be for something that keeps on happening (even if, as in this case, it keeps happening only 1% of the time).

    A bug that’s pervasive would probably also be persistent, but not necessarily vice-versa. :)

  35. Tony Mechelynck November 20, 2011 at 1:29 pm #

    Yes, I read it like a whodunit.

    I think Mike Ferrier has the right nothion: “it reminds you of what you did in the past”.

    Once upon a time (30 or 40 years ago), I used to be working in COBOL and assembly language on a 128 KiB mainframe, and someday the machine started halting erratically with invalid operation codes in the resident monitor (the kernel, for you young’uns). After delving deep enough in the code (yes, we had the mainframe’s full OS source in those times, and searching around, even in assembly-language source — C didn’t exist yet — isn’t so hard when you have written down the IP) I found out that two instructions had been written in the wrong sequence, causing an invalid operation code halt in privileged code (so that a “privileged operation code” /exception/ wouldn’t be triggered) if an external interrupt (such as the end of a tape block read, or maybe an interval-timer interrupt) happened while handling an internal interrupt (such as authorizing a certain specific privileged opcode which had been encountered in a usermode program). So I swapped these two instructions in the relevant system macro, recompiled and reinstalled the OS over a weekend, and submitted a software report (by snail mail, of course: no Internet back then). I still remember the exhilaration when the software notice came back, telling me my “temporary solution” had been found equivalent to the official fix.

  36. Sleen November 22, 2011 at 11:46 pm #

    Excellent read. I am inspired now to go write some code. Thank you for that. I will gladly share a water cooler with you.

  37. robert sullivan November 23, 2011 at 7:54 pm #

    I was perversely hoping to read this to confirm my thoughts that, while I dislike working in Spring, I could say “aha, much as Spring is an annoyance, it has a very good security layer, much more mature than Rails (does that even have security?), so I’m really better off, no matter how fun and easy those rails developers are working” – but I guess I can’t.

    “instead of throwing exceptions, it would silently just clear the session and re-run the request.”

    I will say it is annoying when something that used to throw an exception, in this case swallows the exception, and marches on. I guess that’s considered “handling” the exception, but not even a warning in the log? Surely some of the blame lies in making a change like this.

    An environment as close to production as possible would have helped in snaring this bug, but obviously there are limits and one could never truly replicate a prod environment, and in this case with a fairly subtle error due to timing issues. There’s a reason why Doug Lea’s excellent util.concurrency was brought into Java 1.5 – concurrency is hard to get right.

    This does remind me of “The Cuckoo’s Egg”, about an astronomer that notices a 2 cent difference between the standard Unix (BSD in this case) accounting program, and a custom accounting program, written to provide a little extra functionality that the standard tool did not provide. Long story short, it put the author (Clifford Stoll) on the trail of same hackers who gained access into the Lawrence Livermore Lab (Hagbard?). Fascinating story, there are clips on YouTube, including Cliff’s story, and a movie about the Germans (one of the guys apparently committed suicide but the circumstances were strange. Movie is in German but might have subtitles?) Anyway – all because this crazy astronomer wanted to figure out why there was a 2 cent difference in his accounting programs!

  38. Tobias December 8, 2011 at 2:59 pm #

    Yes, 2.3.11 was a doozy of an update. It broke my application too in precisely that way. I dread to think of the collective human hours lost because someone decided that silently deleting sessions would be a good security fix.

  39. emma January 4, 2012 at 3:07 pm #

    You’re a lifesaver – thanks for writing this up. Saved me a long night.

Trackbacks/Pingbacks

  1. Reviews and News on Tech Products, Software and Downloads | PCWorld - November 17, 2011

    [...] the original here: I Saw An Extremely Subtle Bug Today And I Just Have To Tell Someone Tags: advice, business, creator, css, halloween, internet, javascript, marketing, production, [...]

  2. What are Some Great Posts on Debugging Tough Problems? | Debugging - November 17, 2011

    [...] there was a great post on debugging CSRF problems in Rails. I thought it was interesting and had run into something similar but not nearly [...]

  3. I Saw An Extremely Subtle Bug Today… - Monday By Noon - Monday By Noon - November 18, 2011

    [...] I Saw An Extremely Subtle Bug Today And I Just Have To Tell Someone | Kalzumeus Software. [...]

  4. Be a Paranoid Pessimistic Programmer | GrokCode - November 21, 2011

    [...] less prone to bugs.Libraries are updated, frameworks evolve, requirements change. Programmers are constantly fighting against entropy. Defensive programming guards against the effects of software entropy by reducing the possibility [...]

Loading...
Free video + email advice on making & selling software:
(1~2 emails a week.)