Thursday, December 10, 2009

Debugging Data Corruption #1

I recently helped someone debug a data corruption problem on an embedded system that was not RTEMS based and that I had not written any code for. Just to be sure this is perfectly clear, I knew nothing about the source code to this application or its design when I arrived. They had a global variable whose value was becoming invalid. For the purposes of this discussion, the corrupted variable will be referred to as CorruptedVariable.. They had a way to reproduce the corrupted and could tell based upon the behaviour of the program that at least CorruptedVariable was being clobbered.

When I arrived, I listened a few minutes and watched them reproduce the problem. I then asked they knew how to produce a numerically sorted symbol table for their application. They did not and I explained that with any GNU toolset using binutils, there was included a utility named nm which is used to produce a symbol table from an executable. I instructed them to use a command similar to the following:

avr-nm -g -n application >application.num

The -g option requests that the symbol table include only global symbols and the -n option requests that the symbol table be sorted by address. We then looked at application.num in a text editor and I had them search for CorruptedVariable.

I then looked at the variables immediately before (e.g. lower addresses immediately preceding) CorruptedVariable in memory. The variable immediate preceding CorruptedVariable was a 32-bit integer (RandomVariable so it was unlikely that accesses to it were the culprit. But the variable before it had buffer in the name. That was a big hint and I asked to see the corresponding source code. The C source code where the CorruptedVariable was declared was similar to the following:

float TooShortBuffer[8];
int RandomVariable;
int CorruptedVariable;

So TooShortBuffer was indeed an array and not a structure. For the purposes of this bug, that was awesome. It meant that using an index that was greater than 8 would overwrite the variables immediately after it in memory. And we had confirmed what the variables which would be overwritten were.

I asked what TooShortBuffer was used for and one of the team mentioned that they had recently added something and might have forgotten to extend all the arrays in the program. As a quick test, I asked them to add 100 elements to that array. Not surprisingly the test scenario now worked.

I recommended to the team that the array now be sized with a hard number but with a macro defined to the maximum number of elements. In addition, I recommended that when using an index into this array to access it, validate that the index is within range. Self-checking programs are so much easier to debug.

I spent less than an hour on the customer site and left then with a smile.

In my next post, I will show how gdb can be used to help locate the source of stray writes.


  1. Good story.

    It sounds like I'd love to have a job like yours :)

  2. This particular customer is on a project that is perpetually late and perpetually broken. Worse, a significant portion of the non-embedded code they have is
    15 year old and ugly. :(

    I prefer the days when I get to work on RTEMS. :)