I have been working on a project recently where there has been a lot of tricky debugging to do. The code I’m working on has to live in the same process as third-party code that is not under my control, and which calls into my code unpredictably, often from multiple threads or signal handlers. This is an application environment that’s particularly prone to rare crashes and deadlocks that somehow elude my own thorough test procedures but crop up nonetheless on the machines of potential customers during product evaluations. I used to find these kind of bugs intimidating and nerve-wracking, but (at least on this project and for the moment) I seem to be a match for them.
The process of debugging is (or ought to be) one of calmly, methodically and steadily gathering data that narrows the location of the problem until it can no longer hide and is forced to reveal itself. In my current project the first step is to acquire the log and (in the case of a crash) the backtrace from the customer and analyze it for the sequence of events leading up to the crash or deadlock. This identifies important features of the environment (which programs were running? are they multi-threaded? do they handle signals?) and usually pinpoints a suspect area of code which can be inspected for race conditions and deadlocks. Inspecting the code usually leads to a hypothesis or two about the cause; but even if it doesn’t I develop a test case that thoroughly exercises the problem area. With some experimentation and a lot of thought I’ve so far always found it possible to reproduce the problem under controlled conditions, whereupon it can be analyzed in the debugger.
Writing this is of course a form of hubris: no doubt I will return to work on Monday to find in my inbox a customer report describing a bug that I lack the skill to track down and fix. This hasn’t yet happened in my career, though there have been a few occasions where I have been in doubt as to the outcome, one of which I described in my post ‘The last lousy bug’. Programmers like to call these accounts “war stories”—joking, of course, but joking seriously. It is stressful to know that something is wrong in your code, but not to know what it is or how to find it, and to have it hanging over your head day after day as you fail to make progress towards solving it.