Answering questions on the newsgroups, I've noticed that several developers seem to find debugging very difficult - not the mechanics of it, so much as knowing the right place to start. This is not to say that they are lazy or stupid - just that debugging is an art unto itself (arguably more so than writing code in the first place - it certainly involves more intuition in my view), and that a few pointers could be useful.
Making use of the techniques discussed on this page won't make you an ace bug-finder in itself - a mixture of patience, experience, intuition and good practice is needed - but my hope is that it can get you started along the right path. Note that although the page title is "Debugging", a lot of the time you may well not need to step through your code in a debugger in order to fix your code. If I'm trying to find a problem in my own code, without external dependencies such as other whole systems being involved, I usually regard it as a failure on my part if I need to use the debugger. It indicates that my code isn't clear enough and my unit tests aren't robust enough.
Reproduce the problem
The worst kind of problem is the kind you can't reliably reproduce. This is often down to issues such as race conditions, external systems, or environmental issues which create different patterns of behaviour on deployed systems and developer boxes. Often you won't know to start with whether or not a problem will be reproducible - and conversely, sometimes a problem which has been hard to reliably reproduce during diagnosis becomes easy to reproduce when it is well understood. In the latter case, when you have found the problem but not yet implemented the solution, if you can work out how to provoke it reliably, you should verify your diagnosis by causing the problem repeatedly.
The more specific you can be in the manner of reproduction, the better. This is often down to testers as well as developers - a good tester will describe steps to reproduce a problem in enough detail to let the developer reproduce it immediately. If the tester can't reproduce it but the developer later finds a way of demonstrating the problem every time, it's worth putting those details into whatever bug tracking system you use - this will help to verify the fix when it has been applied.
The Subversion project team have a good name for the steps required to reproduce a problem - they call it a recipe (I wouldn't be surprised to find that they weren't the first project to use the term, but I haven't encountered it elsewhere.) I will adopt the term for the rest of this article. A good recipe should be:
- Simple - the steps involved should all be required to reproduce the problem; the more steps that are required, the more avenues of investigation are opened. If a problem occurs in a multi-stage system (such as a log file being generated by one process, consolidated by another, and then consumed by a third) it is helpful to identift which process is at fault as early as possible and restrict most of the recipe to that system. It may be useful to give an "overall" picture of when the problem would occur in the real world, but that is more useful to a support team answering customer concerns than a developer trying to fix a problem.
- Specific in its steps - if user input is required, sample values are useful. If the values are too long to be included and the problem involves the length rather than the actual data, it's worth trying to find out exactly what length is involved - a recipe stating that everything is fine with 60 characters in a field but things go wrong with 61 is very suggestive in terms of where to look. Where user input isn't involved, a sample of whatever the system consumes is very useful if it's available.
- Specific in its environment - this is particularly important for web applications, where different browsers may exhibit different behaviours. State the operating system, browser type and browser version. It's usually not necessary to know what plug-ins etc are installed, but if the problem involves a plug-in (e.g. a Flash video isn't rendering properly) then versions of the relevant plug-ins are useful too.
- Specific in the problem description - just saying that "the text looks wrong" or "the system breaks" isn't useful. Crash dumps, logs, screenshots with areas of concern highlighted, or whatever accurately describes the difference between how you expect it to behave and how it actually behaves - that's what will be really helpful.
Convert the problem into an automated test
This step isn't always possible, and sometimes it's possible but not practical - but in other situations, it's well worth doing. The automated test may take the form of a unit test, an integration test, or anything else your development context (company, open source project, whatever) uses for its testing. Where possible, a unit test is the ideal - unit tests are meant to be run frequently, so you'll find out if the problem comes back really quickly. The speed of unit test runs also helps when diagnosing the code at fault - if a test runs to completion in a tenth of a second, it doesn't matter if you end up using the debugger and putting the breakpoint just after you intended to - it'll take you longer to move the breakpoint than it will to get the debugger back to the right line.
You may well find yourself writing several unit tests which pass in the course of trying to discover the cause of the problem. I would usually encourage you to leave those unit tests after you've written them, assuming they should pass. Occasionally you may end up writing tests which make very little sense outside the current problem, but even so the "default position" should be to keep a test once it's present. Just because it's not the current problem doesn't mean it won't identify a problem later on - and it also acts as further documentation of how the code should behave. The working unit tests are often a good way to find which layer within a particular system is causing the problem. In a well-layered system, you should often be able to take the input of one layer and work out what its interaction with the layer below should be. If that passes, use that interaction directly as the input to the tests for the layer below, working your way down the system until something misbehaves.
Refactor your unit tests as aggressively as you'd refactor your production code. In particular, if you have a large unit test which is actually testing a relatively large amount of code, try to break it down into several tests. If you're lucky, you'll find that a large failing unit test which is based on the original recipe can be broken down so that you end up with several passing tests and a single, small failing test.
Even if you spot the problem early and can fix it in a single line, write a unit test to verify it wherever practical. It's just as easy to make mistakes fixing code as it is writing code to start with, and the test-first principles still apply. (I'm assuming a leaning towards test-driven development here, because I happen to find it useful myself. If you don't do TDD, I wouldn't expect you to necessarily agree with this step.)
Don't assume things work the way they're meant to
A certain amount of paranoia is required when bug-fixing. Clearly something doesn't work as it's meant to, otherwise you wouldn't be facing a problem. Be open-minded about where the problem may be - while still bearing in mind what you know of the systems involved. It's unlikely (but possible) that you'll find that the cause of the problem is a commonly-used system class - if it looks like System.String is misbehaving, you should test your assumptions carefully against the documentation before claiming to have found a definite bug, but there's always a possibility of the problem being external.
Be wary of debuggers. Sometime they lie. I personally trust the results of logging much more than the display in a debugger window, just because I've seen so many problems (mostly on newsgroups) caused by inaccurate or misleading debugger output. I know this makes me very old-fashioned, but you should at least be aware of the possibility that the debugger is lying. I believe this problem was much worse with Visual Studio 2002/2003 than it is with 2005, but even so a certain amount of care is appropriate. In particular, bear in mind that evaluating properties involves executing code - if retrieving a property changes some underlying values, life can get very tricky. (Such behaviour is far from ideal in the first place, of course, but that's a different matter.)
The main point of this section is to discourage you from ever saying, "The problem can't be there." If you believe it's not, prove it. If it looks like it might be, however unlikely that may seem, investigate further.
Be clear in your mind about correct behaviour
If the purpose of a piece of code (which may or may not be the cause of the problem) is unclear to you, consider taking some time to investigate it a bit. Put some tests around it to see (and document) how it behaves. If you're not sure whether the behaviour is correct or not, ask someone. Either write a test case or possibly a recipe to reproduce the questionable behaviour, and find out what it's meant to do and why. Once it's clear to you, document it somehow - whether with a test, XML documentation, an external document, or just a link from the code to a document which already explains the behaviour.
If the correct behaviour is currently not clearly defined, consider whether or not it should be. It might not need to be - in some situations it's perfectly acceptable to leave things somewhat vague - this gives more wiggle room for a later change. However, that wiggle room should itself be documented somewhere - make it clear what outcomes are acceptable, what outcomes aren't, and why the correct behaviour isn't more tightly specified.
Fix one problem at a time
If a piece of code is particularly bad, you may well spot other problems while you're fixing the original one. In this case, it's important to choose which problem you're going to tackle first, and tackle just that problem. In an ideal world, you should fix one problem, check your code into version control, then fix the next one, and so forth. This makes it easy to tell at a later date exactly which change in code contributed to which change in behaviour, which is very important when a to one problem turns out to break something else.
This is a very easy piece of advice to ignore, and I've done so many times in the past - and paid the price. This situation is often coupled with the previous one, where you really aren't clear about exactly what a piece of code is meant to do. It's a dangerous situation, and one which should encourage you to be extra careful in your changes, instead of blithely hacking away. Once when I was in the middle of just such a mammoth debugging session, a good friend compared my approach to taking great big swords and slashing through a jungle of problems, with his favoured approach being a more surgical one, delicately changing the code just enough to sort out the problem he was looking at, without trying to fix the world in the process. He was absolutely right, which is no doubt a reasons for one of his projects -
Get your code to help you
Logs can be incredibly useful. They can give you information from the field that you couldn't possibly get with a debugger. However, all too often I see code in newsgroups which removes a lot of vital information. If an exception is thrown in a piece of code which really should be rock solid, don't just log the message of the exception - log the stack trace. If there's a piece of code which "reasonably regularly" throws exceptions (for instance, if one system in a multi-system deployment is offline) then at least try to provide some configuration mechanism for enabling stack traces to be logged when you really need them to be.
Your code should almost never silently swallow exceptions without logging them. Sometimes exceptions are the simplest way of validating user input or configuration information, even though that leaves a bad taste in the mouth - and in those cases a default value is often used and there's no need to log anything. That kind of situation is what I think of as an "expected exception". I know that to many people the whole concept of an exception which you know about in advance is anathema, but in a practical world these things happen. Aside from these situations, exceptions should very rarely be swallowed except at the top level of a stack - a piece of code which has to keep going even if something has gone wrong lower down. At that point, logging is truly your friend, even if there are good reasons why it would produce too much information if you had it permanently enabled. There are plenty of ways of making logging configurable. Use one, and make sure your support engineers know how to turn the logging on appropriately and how to extract the logs produced.
Similarly, make your code defensive against bad input. The earlier bad input is caught, the easier it is to find. If you only get an exception once a value has been processed by several methods, it can be very hard to track down the source of the problem. I used to be more aggressive about defending against values being inappropriately null etc on entry to methods. Recently I've become laxer about this because of the additional overhead it produces in terms of testing and coding, as well as the code being clear about what it's really trying to achieve, rather than what it's trying to prevent. In an ideal world, contracts would be significantly easier to express. I believe this will become more mainstream eventually, and at that point my code is likely to become more widely saturated with it. Until that time, I'm likely to add validation when I think there could be a problem. Once it's there, I generally leave the validation in the code, because if someone passes an invalid value in at some point in time, it's quite possibly that the same kind of problem will occur again - at which point we may be able to avoid a lot of investigative work by catching the bad input early.
Learn your debugger's capabilities
Modern debuggers have many features which I suspect are underused. Some of them, such as conditional breakpoints and user-defined representations of data structures, can make the difference between a debugging session which lasts a whole day and one which lasts 10 minutes. You don't need to know everything about it off by heart, but an idea of what it's capable of and a good knowledge of the shortcut keys involved for the basics (e.g. step over, step into, set breakpoint, resume execution) are pretty much vital. It helps if everyone on the team uses the same shortcuts - unless you have a good reason to change the defaults, don't. (One such reason may be to make two different IDEs use the same set of shortcuts. For example, I usually make Eclipse use the same shortcuts as Visual Studio.)
This is one area I'm probably weak in, due to my attempts to usually avoid working in the debugger in the first place. I still believe that it's worth knowing the debugger reasonably well, and I still believe that you should usually only start stepping through code when you really need to, having tried to understand what it will do through inspection and by adding investigative unit tests. The debugger may not quite be a last resort, but I believe it shouldn't be far off. It depends on the situation, of course - the more control you have over the system, the less necessity there should be for using a debugger. If you're interoperating with other systems, you may well need to be running debuggers on all of them in order to make sure that the appropriate code triggers fire at just the right time.
Rest often, and involve other people
Debugging is hard. I have the utmost respect for those whose day to day work involves fixing other people's code more than writing their own. It can easily be demoralising, and working with a poorly documented quagmire of code can bring on physical tiredness quite easily. Take this into account: if you're stuck, take a break, have a walk, have a coffee, a chocolate bar - whatever helps you.
Don't feel ashamed if you need to get someone else involved - if behaviour seems impossible, or if you just don't know where to look, ask. Watch what the other person is doing, both to learn from them and to tell them if it looks like they're going down a blind alley you've already seen. (If they are, it may still be productive to go down it again - you may have missed something - but you should at least let them know that you think it may not be useful.)
Test, test, test
Once you believe you've fixed the problem, do everything you can (within reason) to test it. Obviously your first port of call should be unit tests - the unit tests you've added to isolate the problem should now pass, but so should all the rest of the tests. In addition, if there's a "system level" recipe for the problem, do your best to go through the recipe to verify that you've really fixed the problem. Quite often, a high-level problem is caused by several low-level ones - it's easy to find the first low-level one, fix it, and assume that means the high-level problem will go away. Sometimes quite the reverse happens - the high-level behaviour becomes worse because a more serious low-level problem comes into play.
If the problem was originally raised by a tester, it can often be useful to demonstrate the fixed behaviour to them on your development box to check that you really have done what was anticipated. This is particularly true of aesthetic issues in the UI. Collaboration at this point can be quick and cheap - if you mark a problem as fixed in your bug-tracking system, wait for the tester to retest and fail the fix, reassign it, then pick it up again, you may well find you're no longer in nearly as good a position to fix the problem, quite aside from the time wasted in administrative overhead. It's important to have good relationships with your test engineers, and the more confidence they have that you're not going to waste their time, the more productive you're likely to be with them. If you find you do have to go back and forth a few times, be apologetic about it, even if it would have been hard to avoid the problem. (At the same time, do press testers for more details if their recipes aren't specific enough. Developing a productive code/test cycle is a two-way process.)
Consider the bigger picture
Problems often come in groups. For example, if you find that someone has failed to perform appropriate escaping/encoding on user input in one situation, the same problem may well crop up elsewhere. It's much cheaper to do a quick audit of potential errors in one go than it is to leave each error to be raised as a separate problem, possibly involving different developers going through the same investigations you've just performed.
In a somewhat similar vein, consider other effects your fix may have. Will it provoke a problem elsewhere which may not be fixable in the short term? Will it affect patching or upgrades? Is it a large fix for this point in your product lifecycle? It's almost always worth getting to the bottom of what a problem is, but sometimes the safe thing to do is leave the code broken - possibly with a "band-aid" type of fix which addresses the symptoms but not the cause - but leave the full fix for the next release. This does not mean you should just leave the bug report as it is, however. If you've got a suggested fix, consider producing a patch file and attaching it to the bug report (making it clear which versions of which files need patching, of course). At the very least, give a thorough account of your investigations in the bug report, so that time won't be wasted later. A bug status of "Fix next release" (or whatever your equivalent is) should mean just that - it shouldn't mean, "We're unlikely to ever actually fix this, but we'll defer it one release at a time."
No comments:
Post a Comment