When I arrived, I listened a few minutes and watched them reproduce the problem. I then asked they knew how to produce a numerically sorted symbol table for their application. They did not and I explained that with any GNU toolset using binutils, there was included a utility named nm which is used to produce a symbol table from an executable. I instructed them to use a command similar to the following:
avr-nm -g -n application >application.num
The -g option requests that the symbol table include only global symbols and the -n option requests that the symbol table be sorted by address. We then looked at application.num in a text editor and I had them search for CorruptedVariable.
I then looked at the variables immediately before (e.g. lower addresses immediately preceding) CorruptedVariable in memory. The variable immediate preceding CorruptedVariable was a 32-bit integer (RandomVariable so it was unlikely that accesses to it were the culprit. But the variable before it had buffer in the name. That was a big hint and I asked to see the corresponding source code. The C source code where the CorruptedVariable was declared was similar to the following:
float TooShortBuffer[8];
int RandomVariable;
int CorruptedVariable;
So TooShortBuffer was indeed an array and not a structure. For the purposes of this bug, that was awesome. It meant that using an index that was greater than 8 would overwrite the variables immediately after it in memory. And we had confirmed what the variables which would be overwritten were.
I asked what TooShortBuffer was used for and one of the team mentioned that they had recently added something and might have forgotten to extend all the arrays in the program. As a quick test, I asked them to add 100 elements to that array. Not surprisingly the test scenario now worked.
I recommended to the team that the array now be sized with a hard number but with a macro defined to the maximum number of elements. In addition, I recommended that when using an index into this array to access it, validate that the index is within range. Self-checking programs are so much easier to debug.
I spent less than an hour on the customer site and left then with a smile.
In my next post, I will show how gdb can be used to help locate the source of stray writes.
Good story.
ReplyDeleteIt sounds like I'd love to have a job like yours :)
This particular customer is on a project that is perpetually late and perpetually broken. Worse, a significant portion of the non-embedded code they have is
ReplyDelete15 year old and ugly. :(
I prefer the days when I get to work on RTEMS. :)