Thursday, December 10, 2009

PC386 Memory Size Bug

Another day, another bug. At least this one was interesting. This issue was initially reported as Ada programs compiled for the pc386 BSP would run on the Qemu simulator but would not run on the real embedded PC we were using. Since GNAT/RTEMS toolsets are built from source and users do not get the luxury of using the RPMs, my first suspicion was an Ada specific issue. But after a few days of thinking on this, it occurred to me that the code was locking up long before Ada code ran. I stripped the test case down to C only and added a dummy method for gnat_main(). This used the exact RTEMS application configuration and initialization as the broken Ada program did. This program failed in exactly the same way. Feeling the rush of some success, I did a binary search to determine what configuration setting was making this fail. After a few attempts, I discovered that when the application configued zeroing out the RTEMS Work Area, the board locked up. Eureka!

With that piece of insight, I began to focus on the code that zeroed memory. I looked at the starting address and length. The BSP initialization claimed there was 1007MB of RAM available on the board. It looked fine given the board's configuration. I decided to try deliberately lowering the amount of RAM RTEMS knew about. Various low values worked and I tried 1004MB. That worked. I suddenly realized that some memory was reserved for some purpose. I went away to think since I had a work around.

I looked up the manual for the motherboard's chip set and low and behold, I had forgotten that the video controller shared RAM from the main RAM. We needed to stay away from that area. I posted to the RTEMS mailing list and it was suggested that we should trust the RAM size provided by the multiboot information available NOT the amount determined by our dynamic sizing probe. Looking at the 4.9 release branch, it turned out that that code did give priority to the multiboot information.

I compared the two pieces of code and quickly came to the conclusion that updates had broken this code in a subtle way. A little restructuring and clean up and the problem was resolved. But it had been a long and circuitous path from initial report to problem resolved.

What are the lesson to learn?

In this case, there are two lessons to take away. First, a problem may not be what it first seems to be. My first impression was that this was an Ada related issue not a very low BSP specific issue.

The second lesson is that divide and conquer is a very good strategy. You need to narrow down the problem space so it is possible to track down the specific issue.

No comments:

Post a Comment