A Bug, a Trace, a Test, a Twist

Here is the story of a bug that I caused, found, and fixed recently. It is not particularly hard or tricky, and it didn’t take long to find and fix. Nevertheless, it did teach me some good lessons.

The Bug

At work, one of the systems we make is used by mobile phone operators to deliver text messages (SMSs). It is a fairly complex system, which usually handles hundreds or thousands of messages per second. Recently, we improved the part of the system that handles concatenated SMSs (messages that is delivered in several parts). Before this new feature, all concatenated SMSs were stored in the database, then collected after a short delay and delivered. This was done to make sure all parts were delivered in the correct order, since parts can arrive out of order, and furthermore, the parts can arrive at different nodes.

The new feature was to collect the parts in memory instead of in the database, in order to increase the capacity. All parts of a given message are collected in a specific node, determined by a hashing algorithm on the destination phone number. The new feature worked as expected, but a while ago, we started to get some trouble reports concerning failures to deliver concatenated messages. Included with the error reports was a stack trace showing an ArrayIndexOutOfBoundsException. The problem happened in the part of the code I wrote, where the hashing of the phone number gives which node should handle all the parts of the given message. It is quite straightforward – just take the hash code modulo the number of buckets to find the node:

int getBucket(Digits destMsisdn) {
 return destMsisdn.hashCode() % numberOfBuckets;
}

The exception was thrown because the index was -3. This was puzzling, since the feature was working correctly in almost all cases.

The Trace

The exception is including the information that the index was -3. This is better than a lot of other Java exceptions, but not as good as it could be. Had the index been too high (i.e. 4 for an array of length 4), you would not know from the exception how big the array was. A good exception or error message should  include as much dynamic information as possible. For an ArrayIndexOutOfBoundsException that means both including the offending index, and including the valid array range. Not including all dynamic information is unfortunately a very common failing for many Java exceptions.

The stack trace shows exactly where things go wrong, but it doesn't show why. Why was the index negative? You can't tell just by looking at the stack trace. You also need to know what data the algorithm was operating on. Fortunately, in our system we have the option of using Trace on Error. This is a form of session-based logging (called tracing at work, to distinguish it from traditional logging).

The trace included with the error report showed the complete message received, as well as all key steps taken until the error occurred. Among other things, it included the destination MSISDN (phone number).

The Test

There are many steps when handling the concatenated messages the new way. Receive incoming parts, find which node should collect all the parts, collect the parts, deliver all parts when the last part arrives, time out if not all parts are received etc. If the logic as at all complicated, I like to break it down into separate testable methods, and unit test the parts in isolation. That way, when I am putting it all together, at least I know that each part does what it is supposed to do.

In this case, I had unit tests that tested 3 different phone numbers in three cases: only 1 bucket, 2 buckets, and 1000 buckets:

public void testGetBucket() {
 ConcatInMemoryHandler handler;
 Digits destMsisdn1 = new Digits("46702000021");
 Digits destMsisdn2 = new Digits("6");
 Digits destMsisdn3 = new Digits("4678900112");

 // Only 1 bucket - all buckets should be 0
 handler = new ConcatInMemoryHandler(1, 1,
 UUID.randomUUID(), UUID.randomUUID(), user);
 assertEquals(0, handler.getBucket(destMsisdn1));
 assertEquals(0, handler.getBucket(destMsisdn2));
 assertEquals(0, handler.getBucket(destMsisdn3));

 // 2 buckets
 handler = new ConcatInMemoryHandler(2, 1,
 UUID.randomUUID(), UUID.randomUUID(), user);
 assertEquals(1, handler.getBucket(destMsisdn1));
 assertEquals(0, handler.getBucket(destMsisdn2));
 assertEquals(0, handler.getBucket(destMsisdn3));

 // 1000 buckets
 handler = new ConcatInMemoryHandler(1000, 1,
 UUID.randomUUID(), UUID.randomUUID(), user);
 assertEquals(465, handler.getBucket(destMsisdn1));
 assertEquals(736, handler.getBucket(destMsisdn2));
 assertEquals(378, handler.getBucket(destMsisdn3));
}

The first thing I did was to plug in the phone number from the trace into the unit test. Lo and behold, I got the same exception. Now it was easy to find the cause of the problem: the hashCode-method can (of course) return negative values as well as positive values. Even though I had tested with several different phone numbers, I hadn't picked one big enough to cause the hashCode-method to return a negative value. The phone number from the trace was 543484900001 (slightly modified, to avoid showing a real subscriber's number), and this was big enough to generate a negative hash code.

The fix seemed obvious - just take the absolute value (changing the negative value to the corresponding positive value). However, there was a twist.

The Twist

A long time ago, I read a great article on bug fixing: Three Questions About Each Bug You Find by Tom Van Vleck. When you find and fix a bug, you should ask yourself 3 questions:

  1. Is this mistake somewhere else also?
  2. What next bug is hidden behind this one?
  3. What should I do to prevent bugs like this?

I always try to ask these questions whenever I fix a bug, and in this case a variation of question 2 made me see a problem with my first attempt at a solution:

int getBucket(Digits destMsisdn) {
 return Math.abs(destMsisdn.hashCode()) % numberOfBuckets;
}

We had recently discussed Integer.MIN_VALUE and Integer.MAX_VALUE at work. One of the properties of the binary encoding is that there is one more negative int value than there are positive int values. Therefore I was wondering what would happen if you take the absolute value of Integer.MIN_VALUE. A quick check showed that it returns the same value. So there is one (very unusual case) where the absolute value of a Java int is negative. One solution would have been to explicitly check for this case, and let getBucket() return zero instead in this case. However, I decided on this solution instead (and this is a case that warrants a comment):

int getBucket(Digits destMsisdn) {
 // hashCode() can give negative values. Math.abs() gives
 // negative value for Integer.MIN_VALUE, so important
 // to do Math.abs() *after* taking the remainder.
 return Math.abs(destMsisdn.hashCode() % numberOfBuckets);
}

I also added the phone number as the fourth phone number in the unit test.

Lessons Learned

So, what did I learn from this one bug. Several things:

The hashCode method can return negative values. Most of the time this doesn't matter. But since I was using the hash code to pick an array index, I was only thinking of the value as unsigned. Even though I know that integers can be both positive and negative, in this context I was blind to it.

A trace makes trouble shooting so much easier. There is a big difference between only having a stack trace, and having all the session data available when trying to figure out what happened.

Unit tests help, but don't catch everything. Well-factored code and unit tests make testing a hypothesis and testing a fix quick and easy. But bugs still get through. Therefore, you also need to have a system that is easy to trouble shoot.

Always ask the 3 questions. There may be more to the bug than you think, so make sure to use the 3 questions to see more cases.

Not bad for one bug!

One response to “A Bug, a Trace, a Test, a Twist

  1. My colleague Per-Åke had a good comment. He checked how the hash code is turned into an array index in java.util.HashMap. It turns out that since the length of the backing array is always a power of two, the index is taken by masking out the higher bits with the bitwise and-operator. So no need to take the absolute value there.

    He also noticed the extra hash()-method they use to avoid collisions. Check out the code and the comments in the HashMap source code – it’s worth the read.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s