18 Lessons From 13 Years of Tricky Bugs

In Learning From Your Bugs, I wrote about how I have been keeping track of the most interesting bugs I have come across. I recently reviewed all 194 entries (going back 13 years), to see what lessons I have learned from them. Here are the most important lessons, split into the categories of coding, testing and debugging:

Coding

These are all issues that have caused difficult bugs for me in the past:

1. Event order. When handling events, it is fruitful to ask the following questions: Can the events arrive in a different order? What if we never receive this event? What if this event happens twice in a row? Even if it would normally never happen, bugs in other parts of the system (or interacting systems) could cause it to happen.

2. Too early. This is a special case of “Event order” above, but it has caused some tricky bugs, so it gets its own category. For example, if signaling messages are received too early, before configuration and start-up procedures are finished, a lot of strange behavior can happen. Another example: when a connection was marked as down even before it was put into the idle list. When debugging that problem, we always assumed it got set to down while it was in the idle list (but then why wasn’t it taken out of the list?). It was a failure of imagination on our part not to consider that things sometimes happen too early.

3. Silent failures. Some of the hardest bugs to track down have (in part) been caused by code that silently fails and continues instead of throwing an error. For example, system calls (like bind) that return error codes that aren’t checked. Another example: parsing-code that just returned instead of throwing an error when it encountered a faulty element. The call continued for a while in a faulty state, making the debugging much harder. It is better to return an error as soon as a failure case is detected.

4. If. If-statements with several conditions , if (a or b), especially when chained, if (x) else if (y), have caused many bugs for me. Even though if-statements are conceptually simple, they are easy to get wrong when there are multiple conditions to keep track of. These days I try to rewrite the code to be simpler to avoid having to deal with complicated if-statements.

5. Else. Several bugs have been caused by not properly considering what should happen if a condition is false. In almost every case, there should be an else-part for each if-statement. Furthermore, if you set a variable in one branch of an if-statement, you should probably set it in the other as well. Related to this is the case when a flag is set. It is easy to only add the condition for setting the flag, but forgetting to add the condition for when the flag should be reset again. Leaving a flag set forever will likely lead to bugs down the road.

6. Changing assumptions. Many of the bugs that were the hardest to prevent in the first place were caused by changing assumptions. For example, in the beginning there could only be one customer event per day. Then a lot of code is written under this assumption. At some later point, the design is changed to allow multiple customer events per day. When this happens, it can be hard to change all cases that are affected by the new design. It is easy to find all the explicit dependencies on the change, but the hard part is to find all the cases that implicitly depend on the old design. For example, there may be code that fetches all customer events for a given day. An implicit assumption may be that the result set is never greater than the number of customers. I don’t have a good strategy on how to prevent these problems, so suggestions are welcome.

7. Logging. Visibility into what the program does is crucial, especially when the logic is complicated. Make sure to add enough (but not too much) logging, so you can tell why the program does what it does. When everything works fine, it doesn’t matter, but as soon as (the inevitable) problem happens, you will be happy that you added proper logging.

Testing

As a developer, I am not done with a feature until I have tested it. At a minimum this means that every new or changed line of code has been executed at least once. Furthermore, unit testing or functional testing is good, but not enough. The new feature must also be tested and explored in a production-like environment. Only then can I say that I am done with a feature. Here are some important lessons my bugs taught me about testing:

8. Zero and null. Make sure to always test with zero and null (when applicable). For a string it means both a string of length zero, and a string that is null. Another example: test the disconnection of a TCP connection before any data (zero bytes) was sent on it. Not testing with these combinations is the number one reason for bugs slipping through that I should have caught when testing.

9. Add and remove. Often new features involves being able to add new configurations to the system, for example a new profile for phone number translation. It is very natural to test that it works to add a new profile. However, I have found that it is easy to forget to test the removal of the profile as well.

10. Error handling. The code that handles errors is often hard to test. It’s best to have automatic tests that check the error handling code, but sometimes that is not possible. One trick I sometimes use then is to modify the code temporarily to cause the error handling code to run. The easiest way to do this is to reverse an if-statement, for example flipping it from if error_count > 0 to if error_count == 0. Another example is misspelling a database column name to cause the desired error handling code to run.

11. Radom input. One way of testing that can often reveal bugs is to use random input. For example, the ASN.1 decoding of the H.323 protocol operates on binary data. By sending in random bytes to be decoded, we found several bugs in the decoder. Another example is to generate scripts with test calls, where the call duration, answer delay, first party to hang up and so on were all randomly generated. These test scripts exposed numerous bugs, particularly where there were interference from events happening close together.

12. Check what shouldn’t happen. Often testing involves checking that a desired action happened. But it is easy to overlook the opposite case – to check that an action that shouldn’t happen actually didn’t happen.

13. Own tools. Usually I have created my own small tools to make testing easier. For example, when I worked with the SIP protocol for VoIP, I wrote a small script that could reply with exactly the headers and values I wanted. That tool made testing a lot of corner cases easy. Another example is a command line tool that can make API calls. By starting small, and gradually adding features as needed, I have ended up with very useful tools. The advantage of writing my own tools is that I get exactly what I want.

It is never possible to find all bugs in testing though. In one case, I made a change to the handling of correlation numbers that consisted of two parts: the routing address prefix (always the same), and the dynamically allocated number from 000 to 999. The problem was that when finding the correlation, the first digit of the dynamically allocated number was mistakenly removed before looking in the table. So instead of looking for e.g. 637, you were looking for 37, which wasn’t in the table. This means that it worked up until 100, so the first 100 calls worked, then all the 900 following failed. So unless I tested more than 100 times before restarting (which I didn’t), I would not find this problem when testing.

Debugging

14. Discuss. The debugging technique that has helped me the most in the past is to discuss the problem with a colleague. Often it is enough to simply describe the problem to a co-worker for me to realize what the problem is. Furthermore, even if they are not very familiar with the code in question, they can often come up with good ideas of what could be wrong anyway. Discussing with a co-worker has been especially effective with my most difficult bugs.

15. Pay close attention. Often when debugging a problem took a long time, it was because I made false assumptions. For example, I thought the problem happened in a certain method when in fact it never even got to that method in the first place. Or the exception that was thrown wasn’t the one I assumed it was. Or I thought the latest version of the software was running, but it was an older version. Therefore, be sure to verify the details instead of assuming. It’s easy to see what you expect to see, instead of what is actually there.

16. Most recent change. When things that used to work stop working, it is often caused by the last thing that was changed. In one case, the most recent thing changed was just the logging, but an error in the logging caused a bigger problem. To make regressions like this easier to find, it helps to commit different changes in different commits, and to use clear descriptions of the changes.

17. Believe the user. Sometimes when a user reports a problem, my instinctive reaction is: “That’s impossible. They must have done something wrong.” But I have learnt not to react that way. More times than I would like, it turns out that what they report is what actually happens. So these days, I take what they report at face value. Of course I still double check that everything has been set correctly etc. But I have seen so many cases where weird things happened because of unusual configuration or unanticipated usage, that my default assumption is that they are correct and the program is wrong.

18. Test the fix. When a fix for a bug is ready, it must be tested. First run the code without the fix, and observe the bug. Then apply the fix and repeat the test case. Now the buggy behavior should be gone. Following these steps makes sure it actually is a bug, and that the fix actually fixes the problem. Simple but necessary.

Other observations

Over the 13 years that I have been keeping track of the trickiest bugs I have encountered, a lot of things have changed. I have worked on a small embedded system, on a large telecom system and on a web-based system. I have worked in C++, Ruby, Java and Python. Several classes of bugs from my C++ days have simply disappeared, like stack overflows, memory corruption, string problems and some forms of memory leaks.

Other problems, like loop errors and corner cases, I see far fewer of because I have been unit-testing more logic. But that doesn’t mean there aren’t bugs – there still are. The lessons in this post help me to limit the damage at the three stages of coding, testing and debugging. Let me know in the comments what other tricks and techniques you have found useful when preventing or finding bugs.

15 responses to “18 Lessons From 13 Years of Tricky Bugs”

Henrik Warne | June 16, 2016 at 1:08 pm | Reply

Reddit discussion: https://redd.it/4obu4w
Pingback: Weekly Links #19 | Useful Links For Developers
logztechstuff | June 22, 2016 at 2:17 pm | Reply

Thank you, very nice post and much needed one as well. We always follow one change at a time that helped us reduce the issue turnaround time.
andomar | June 23, 2016 at 8:33 am | Reply

Nice post. I recognize many of the points from my own experience. “Trust the user” is so true.
Pingback: Java Web Weekly, Issue 130 | Baeldung
纽约网站设计 | August 6, 2016 at 4:15 pm | Reply

Today, I already spent 5 hours of frustration sitting in front of my computer and no progress due to an impossible bug… Errrrrr ….
wenky | September 18, 2016 at 10:09 am | Reply

wow,I like what you say,too good…
Jeshan | January 27, 2017 at 4:12 pm | Reply

Great post, Henrik. Some comments on your points:

About point #10 (error handling) you wrote “One trick I sometimes use then is to modify the code temporarily to cause the error handling code to run”.
Yes that can be good but tedious and not scalable. You can consider using a mutation testing tool that modifies programs and tests them again. For example, they can change conditionals for us.
https://en.wikipedia.org/wiki/Mutation_testing

About point #16, you wrote “it helps to commit different changes in different commits, and to use clear descriptions of the changes”
Absolutely. I find that this idea of making “atomic commits” encourages us to focus on only one task/issue. It can also make it easier in case we want to rollback our changes.
Anvesh | July 25, 2017 at 6:04 am | Reply

About point #6 (Changing assumptions) you wrote “For example, there may be code that fetches all customer events for a given day. An implicit assumption may be that the result set is never greater than the number of customers. I don’t have a good strategy on how to prevent these problems, so suggestions are welcome.” – For this specific case, you could have an event counter that is auto when the customer event(s) is (are) fired. This way at the end of the day you could just check this event counter and fetch your customer events in that way. Given that this post is about an year old, I don’t know how useful this suggestion might be (since you might have already fixed that issue).
- Henrik Warne | July 25, 2017 at 8:53 am | Reply
  
  Thanks Anvesh,
  The problem is finding code that is using implicit assumptions in the first place. Once it is found, it is usually not so hard to handle.
  - Michael Nelson | January 24, 2018 at 8:23 am | Reply
    
    A couple things that can help, although not bulletproof, are to document such assumptions when writing the original code and writing tests to validate that those assumptions hold, with clear failure messages indicating what may break if the assumption is violated. Alternatively, and often easier, it may be possible to find another way to write the code such that it doesn’t require those assumptions, and then you may not need to write those extra tests. Of course, all of this requires recognition of the assumption at the time the code is written.
Pingback: Lập Trình Viên Cần Học Những Gì Từ Bug - ITviec blog
Glen Chiacchieri | October 2, 2017 at 9:57 pm | Reply

> In Learning From Your Bugs, I wrote about how I have been keeping track of the most interesting bugs I have come across. I recently reviewed all 194 entries (going back 13 years), to see what lessons I have learned from them.

Hey! Would you have any interest in posting some version of that list? I believe that often it’s very helpful to have the gritty details of bugs instead of just the insights gained from them. It’s the same reason that you can tell a cliche to someone (“one in the hand is worth two in the bush”), but they won’t actually understand it until they have lived experience of a situation where that cliche applies.

I think that list of bugs could be very valuable in general, and specifically to me as a programming environment researcher 🙂
- Henrik Warne | October 5, 2017 at 7:19 pm | Reply
  
  Hi Glen,
  Yes, it is definitely a good idea. However, I would need to clean the file up a bit before posting it. Firstly because there is some proprietary code snippets in there, and secondly because some descriptions would need a little more context to be understandable to somebody other than me. Hopefully I’ll get around to doing that.
Pingback: Октябрьская лента: лучшее за месяц — Блоги экспертов

18 Lessons From 13 Years of Tricky Bugs

Coding

Testing

Debugging

Other observations

15 responses to “18 Lessons From 13 Years of Tricky Bugs”

Leave a reply to Henrik Warne Cancel reply

MOST POPULAR

RECENT POSTS

TAG CLOUD

RSS

Follow Blog via Email

18 Lessons From 13 Years of Tricky Bugs

Coding

Testing

Debugging

Other observations

Share this:

Related

15 responses to “18 Lessons From 13 Years of Tricky Bugs”

Leave a reply to Henrik Warne Cancel reply

MOST POPULAR

RECENT POSTS

TAG CLOUD

RSS

Follow Blog via Email