Tuesday, March 8, 2011

Coverage Testing Finds Unexpected Bug

Recent posts have been about debugging experiences. Jennifer and I have been working on a simple implementation of the RTEMS priority based scheduler to illustrate how to write an alternative scheduler. The primary RTEMS Scheduler is deterministic (e.g. constant, predictable) in performance because it uses a FIFO per priority and a two level bit map to assist in look ups. The simple scheduler uses a single list for all tasks and searches down the list to perform inserts. Its worst case is thus O(n) where n is the number of ready tasks. But this post is above debugging, not about the new scheduler. We will discuss that another time.

Jennifer and I want the simple scheduler to have 100% test coverage when it is merged. This has required us to run coverage analysis on RTEMS and check the results. We haven't yet merged this code so it isn't showing up in published runs. She noticed that thee method _Scheduler_priority_Enqueue_first was reported as never executed. We both found this incredulous since this is called as part of inheriting a priority. We knew this was well tested. She was perplexed and asked me what I thought. It turned out to be a recently introduced typo where _Scheduler_priority_Enqueue was configured as the handler for both the enqueue and enqueue entries points in the Scheduler Priority table. This also explained a couple of test failures Jennifer had noticed but hadn't investigated yet. This is the pertinent part of patch:

- _Scheduler_priority_Enqueue, /* enqueue_first entry point */ \
+ _Scheduler_priority_Enqueue_first, /* enqueue_first entry point */ \

Again, this problem was highlighted by an unexpected drop in the coverage results. RTEMS has very good test coverage and an unexpected change can indicate a problem. The fact that this method was never called was a huge hint to the cause. The answer popped almost instantly in my head. Debugging the test failures would almost certainly have taken much longer.

Debugging an Invalid Memory Access

This time we have a guest at RTEMS Ramblings -- Chris Johns. Chris is a long time RTEMS contributor and member of the Steering Committee. He was the second person outside the core team to submit code to RTEMS. He has fielded some pretty impressive RTEMS applications. This is his discussion of he and I trying to track down an invalid memory access on the SPARC/SIS BSP. From here on.. "I" == Chris. :D

I have a SIS BSP failure with RTEMS head (+rbtree patch) and the latest tools on a MacOS host running the "Hello" sample application in the erc32 simulator. Joel is not seeing this so I have to dig into what has been a stable platform for me to developer the RTL code on. This is an account of the debugging session as I work though the problem.

I run the hello test and all I get is:

$ sparc-rtems4.11-run sparc-rtems4.11/c/sis/testsuites/samples/hello/hello.exe

Memory exception at ffffffe0 (illegal address)


Not much help from the tool. The same happens if I run the same application in gdb.

Setting a break point on boot_card and entering 'r' to run the application shows me the target is loading, booting then reaching the boot_card function. Nice because this means the tools are not broken and RTEMS and the BSP are sort of sane. I build the MacOS tools from source using the SpecBuilder tool and track Ralf's changes closely. This means there is always a chance something went wrong in the build. I have never seen this but there can always be a first time. I can step the code up to the API_Mutex_initialization call then something goes wrong but I am not sure where. Maybe I should look from the bottom up.

I decide to track down who is printing the error message "Memory exception ...". Running strings and grep on the executable shows it is not in RTEMS. I see it is generated by the simulator. I also see the simulator generates a trap 9 exception. This exception is not handled in the target so the simulator terminates. Fair enough but the stopped simulator destroys the state information of where the fault is which means I cannot see what happens.

A 'sparc-rtems4.11-objdump -D --source hello.exe | less' shows me trap 9 is address 0x2000090 so I set a break point with b *0x2000090 and run again. I break at the trap 9 entry point and stop the simulator exiting:

(gdb) c
Memory exception at ffffffe0 (illegal address)

Breakpoint 4, trap_table () at c/src/lib/libbsp/sparc/erc32/../../sparc/shared/start.S:68
68 BAD_TRAP; ! 09 data access exception


I can now inspect the state of the processor and try and find the source of the problem. First stop is a back trace:

(gdb) bt
#0 trap_table () at c/src /lib/libbsp/sparc/erc32/../../
sparc/shared/start.S:68

Nothing helpful here. We know we are at this location and what we want is what happened before this.

I do not know SPARC processors very well and a dump of the registers gives me little information. I do know the stack works down and most processors save the return address on the stack. GDB is nice by providing me with the stack pointer as $sp. I dump the stack:

(gdb) x /32xw $sp
0x23ffcd0: 0x00000034 0x02007378 0x0200737c 0x00000008
0x23ffce0: 0x00002000 0x02014928 0x00000001 0x00000000
0x23ffcf0: 0x02012a30 0x02014924 0x02012a30 0x00006054
0x23ffd00: 0xffffffff 0x00000004 0x023ffd38 0x0200792c
0x23ffd10: 0x00000000 0x00000000 0x00000000 0x00000000
0x23ffd20: 0x00000000 0x00000000 0x00000000 0x00000000
0x23ffd30: 0x00000000 0x00000000 0x02012a30 0x00006089
0x23ffd40: 0x0201a9ac 0x00000008 0x020148b0 0x02014928


I have a separate window open with the sparc-rtems4.11-objdump output in less so I can search around with ease. With a few simple 'less' commands I can find the code at a specific address. The first address is '0x02007378' which must be the last one pushed. In less a '1G' takes me to the start of the dump then entering '/2007378' and enter brings up some code to do with the heap:

02007374 Heap_Protection_block_check_default:
static void _Heap_Protection_block_check_default(
Heap_Control *heap,
Heap_Block *block
)
{
2007374: 9d e3 bf a0 save %sp, -96, %sp
if (
2007378: c2 06 60 04 ld [ %i1 + 4 ], %g1

Dumping '$l1' gives:

(gdb) p /x $i1
$14 = 0xffffffdc


Dumping '$l1' gives:

(gdb) p /x $i1
$14 = 0xffffffdc

This is very close to the address in question. Time to run again this time with a break point on this address and a couple of displays to help me see what is happening:

(gdb) b *0x2007378
(gdb) display /i $pc
(gdb) display /x $i1
(gdb) r

The break point gets hit a number of times and the arguments all look ok so just continue. On the 6th hit of the breakpoint we get something that does not look ok:

(gdb) c
Breakpoint 8, _Heap_Protection_block_check_default (heap=0x2012a30, block=0xffffffdc) at c/src/../../cpukit/score/src/heap.c:149
149 if (
2: /x $i1 = 0xffffffdc
1: x/i $pc
=> 0x2007378 &#60_heap_protection_block_check>default+4&#62: ld [ %i1 + 4 ], %g1

A back trace this time is much better:

(gdb) bt
#0 _Heap_Protection_block_check_default (heap=0x2012a30, block=0xffffffdc) at c/src/../../cpukit/score/src/heap.c:149
#1 0x0200c82c in _Heap_Protection_block_check (heap=0x2012a30, alloc_begin_ptr=) at ../../cpukit/../../../sis/lib/include/rtems/score/heap.h:625
#2 _Heap_Free (heap=0x2012a30, alloc_begin_ptr=) at c/src/../../cpukit/score/src/heapfree.c:119
#3 0x02007d08 in _Objects_Extend_information (information=0x2012b18) at c/src/../../cpukit/score/src/objectextendinformation.c:224
#4 0x02006cc8 in _API_Mutex_Initialization (maximum_mutexes=1) at c/src/../../cpukit/score/src/apimutex.c:23
#5 0x0200672c in rtems_initialize_data_structures () at c/src/../../cpukit/sapi/src/exinit.c:125
#6 0x0200137c in boot_card (cmdline=) at c/src/lib/libbsp/sparc/erc32/../../shared/bootcard.c:163
#7 0x02001158 in zerobss () at c/src/lib/libbsp/sparc/erc32/../../sparc/shared/start.S:334
#8 0x02001158 in zerobss () at c/src/lib/libbsp/sparc/erc32/../../sparc/shared/start.S:334
Backtrace stopped: previous frame identical to this frame (corrupt stack?)


Walking up the stack with the 'up' command until we end up in the _Object_Extend_information call:

(gdb) up
#3 0x02007d08 in _Objects_Extend_information (information=0x2012b18) at c/src/../../cpukit/score/src/
objectextendinformation.c:224
224 _Workspace_Free( old_tables );
(gdb) p old_tables
$16 = (void *) 0x0

It would seem _Object_Extend_information is calling the workspace with a NULL which should be ok or it use to be ok. I chat with Joel and he informs me the code in the interface to the workspace heap has changed and this has exposed some bugs. A check of the code in the heap free call shows the heap protection check is being called before the block pointer has been validated. This also explains why Joel does not see the problem. I built RTEMS with the debug configure option. There are other cases so I will need to perform a careful check of all the heap code to make sure we are correct.

It looks like I am not the only one who has the problem. Peter Dufault has just posted to the RTEMS user list:

http://www.rtems.org/ml/rtems-users/2011/february/msg00142.html

The PR Peter has kindly raised is:

https://www.rtems.org/bugzilla/show_bug.cgi?id=1746

I was talking with Joel about the changes when I noticed the heap extend now allows discontinuous memory regions. I did not know this was allowed and I had been assuming the memory had to be continuous because of the code in _Heap_Is_block_in_heap. A check of rtems_region_extend shows it uses heap extend and its documentation states memory must be continuous. I have raised a PR to handle this:

https://www.rtems.org/bugzilla/show_bug.cgi?id=1747

Back to Joel.. Chris' discussion should provide some insight into how an free software project and community work. I made a modification to move some scattered NULL checks before calls to _Workspace_Free into that routine. Sebastian Huber noted that since NULL pointers shouldn't be processed by _Heap_Free, so the check was technically redundant so he removed it. This exposed a latent bug in _Heap_Free when passed an invalid address and debug checks were enabled. At the same time the Chris and I were tracking this down, a user tripped the same bug and filed a PR. In reviewing the code, Chris and I found other cases which could cause the same fault and a disconnect between extending the heap and checking whether a block was in the heap. This was a side-effect of recent enhancements and had never been caught by a user. So we see a community coding together, reviewing each other's code, and working together to resolve an issue.

It is important to note that this bug was only present in the RTEMS Development Head. It cannot occur in released versions.