Monday, December 28, 2009

RTEMS Turns 21

The RTEMS ticker test (a.k.a. sp01) was the first test created during the development of RTEMS. It sets the date and time to 9:00 pm on 31 December 1988. This coincides with the earliest part of the RTEMS Project. Since there is no birth certificate, we are treating this as RTEMS birth date.

Please toast RTEMS turning 21 where ever you are.

Please don't be online -- share the New Year's Eve with loved ones. :)

Sunday, December 13, 2009

Violating the Rule of Least Surprise


When you mention user interfaces, people usually think of a graphical interface with mouse and keyboard or a touchscreen. But user interfaces do not exist solely in the world of computer displays. In the world of embedded systems, a user interface can take on any number of variations. One guiding principle that particularly applies in this domain is the "Rule of Least Surprise." An example I use is that if you put someone behind the steering wheel of a vehicle of any type, they would expect that turning the wheel to the left directs the vehicle in that direction. To do anything else, would surprise even a three year old who has ridden a tricycle.

Even though I love my current car, it has two areas which violate this rule. As shown in the picture shown here (source), the instrument panel is in the center -- not directly in front of the driver. Both interface surprises stem from this placement.

The first violation is that when the lights are on at night, there is no light at all directly visible in the direct field of vision of the driver. It is very easy to forget to turn the lights on at night when in a well-lit area because you have to look to the middle to realize that the instruments are unlit. This has actually resulted in me getting pulled over by a friendly police officer to remind me to turn my lights on.

The second violation is when you turn on the turn signal to turn left. The blinking light indicating you have done so is in the right hand side of the driver's field of vision. In other vehicles, this would be in the left hand side
of your field of vision.

Neither of these is a major issue and I have grown accustomed to both. But both show how humans have expectations about how common devices operate. When your clever design changes this, users are surprised. In the end there may be a benefit, but often it is a difference just to be different. Is there a surprise in a system you are familiar with? Is it an improvement or just a difference? Share.

Saturday, December 12, 2009

Fedora 12 Upgrade Experience

A few weeks ago, I upgraded the GNU/Linux computers I am personally responsible for from Fedora 10. This includes a work computer (Dell Latitude D830), a home computer (Dell Inspiron 2400), and the home server (upgraded Dell Dimension 2100). The home server is the host of an Elvis Costello Fan Forum phpBB site as well as the Elvis Costello Wiki so it is not just my family that is impacted by down time. I upgraded from Fedora 10 to 11, ran for about a week to make sure things were OK, and then upgraded to Fedora 12. In the end, all of the computers are up and stable performing their assigned tasks but there were issues along the way which I will describe here.

I upgraded via DVD and all of the computers handled that OK once I remembered the magic required to boot from DVD. But afterward the upgrade was complete, two of the computers did not boot. As not so infrequently during GNU/Linux upgrades, GRUB was not updated correctly on two of them and I had to go into rescue mode to address that. This just required mounting the root partition, changing root and running grub-install. Scary the first time it is required, but not after a few times.

One of the laptops booted nicely to the GUI login prompt but was clearly at the wrong resolution. I did a little net research and learned that deleting the /etc/X11/xorg.conf file and restarting X11 would result in a probe and very likely the correct video settings. This worked.

The only issue shared by all of the machines was an issue with yum getting upgraded between Fedora 10 and 11. yum was broken after the upgrade. Before I could load further updates from then using yum upgrade, I had to manually fetch a current yum rpm file and install it manually using rpm.

The server had three very strange issues during the upgrade.

First, I couldn't log in. The GUI login screen was flashing. I remembered the magic key sequence to switch to a console display, logged in, and checked out the logs. I couldn't tell exactly what was wrong but I had a guess and tried it. Apparently the graphical login program GDM had trouble parsing my email address in the name field of /etc/passwd. I think something was trying to use the user name field in an XML file because my entry was something like "Joel Sherrill ". I suspect that the presence of my email address looking like an XML tag confused it. I deleted that and moved along. GDM was happy.

Next on the list was that something changed in MySQL and the user name table was reported as being corrupt. I found reports on the web where others had suffered this and decided the easiest solution was to stop MySQL, delete the old table, and restart MySQL. When restarted, MySQL would create the table with default contents. From there, I could easily add back in the handful of accounts that were needed. The actual database tables were not impacted. So I felt lucky and moved on to the next problem.

At this point, I started looking at the actual web content served from the server. I first looked at the pbpbb-based Elvis Costello Fan Forum, A bit of the page header was displayed along with an error message. I don't remember why but somehow, I couldn't see the error completely on my side. I enlisted some help on the #rtems IRC channel to see if others could see what the message said. Someone cut and pasted me the error message. I Google'd the message and learned that the new php 5.3 now requires the /etc/php.ini file to explicitly set the timezone. It does not trust the operating system's setting. Without the timezone being set for php, php programs reported nasty error messages. This meant that phpbb3 would not run correctly and the Fan Forum was not available. Adding the following to /etc/php.ini resolved this.


[Date]
date.timezone = America/Chicago


I inspected the pages served from the Elvis Costello Wiki and there were some minor error messages in the logs. Apparently, Mediawiki had some minor mistakes in its coding which php 5.3 detected. These resulted in a plethora of errors in the web logs.

I am a daily GNU/Linux user who rarely works in MS-Windows. As a long time user of various UNIX desktop environments, I prefer mousing over a window to move focus. I do not like to click on a window to move the focus. In previous Fedora distributions, GNOME included an applet to switch the preference under "System > Preferences > Windows". This applet is no longer included by default. I don't know why it isn't part of the default install anymore. Is 59K really that critical? Anyway, the following installed it.


yum install control-center-extra


At this point, I have been using Fedora 12 daily on two desktops and a server. I use the two desktops for development, email, and word processing using Open Office. I have encountered a few issues which I will share in a future post.

Thursday, December 10, 2009

Debugging Data Corruption #1

I recently helped someone debug a data corruption problem on an embedded system that was not RTEMS based and that I had not written any code for. Just to be sure this is perfectly clear, I knew nothing about the source code to this application or its design when I arrived. They had a global variable whose value was becoming invalid. For the purposes of this discussion, the corrupted variable will be referred to as CorruptedVariable.. They had a way to reproduce the corrupted and could tell based upon the behaviour of the program that at least CorruptedVariable was being clobbered.

When I arrived, I listened a few minutes and watched them reproduce the problem. I then asked they knew how to produce a numerically sorted symbol table for their application. They did not and I explained that with any GNU toolset using binutils, there was included a utility named nm which is used to produce a symbol table from an executable. I instructed them to use a command similar to the following:


avr-nm -g -n application >application.num


The -g option requests that the symbol table include only global symbols and the -n option requests that the symbol table be sorted by address. We then looked at application.num in a text editor and I had them search for CorruptedVariable.

I then looked at the variables immediately before (e.g. lower addresses immediately preceding) CorruptedVariable in memory. The variable immediate preceding CorruptedVariable was a 32-bit integer (RandomVariable so it was unlikely that accesses to it were the culprit. But the variable before it had buffer in the name. That was a big hint and I asked to see the corresponding source code. The C source code where the CorruptedVariable was declared was similar to the following:


float TooShortBuffer[8];
int RandomVariable;
int CorruptedVariable;


So TooShortBuffer was indeed an array and not a structure. For the purposes of this bug, that was awesome. It meant that using an index that was greater than 8 would overwrite the variables immediately after it in memory. And we had confirmed what the variables which would be overwritten were.

I asked what TooShortBuffer was used for and one of the team mentioned that they had recently added something and might have forgotten to extend all the arrays in the program. As a quick test, I asked them to add 100 elements to that array. Not surprisingly the test scenario now worked.

I recommended to the team that the array now be sized with a hard number but with a macro defined to the maximum number of elements. In addition, I recommended that when using an index into this array to access it, validate that the index is within range. Self-checking programs are so much easier to debug.

I spent less than an hour on the customer site and left then with a smile.

In my next post, I will show how gdb can be used to help locate the source of stray writes.

PC386 Memory Size Bug

Another day, another bug. At least this one was interesting. This issue was initially reported as Ada programs compiled for the pc386 BSP would run on the Qemu simulator but would not run on the real embedded PC we were using. Since GNAT/RTEMS toolsets are built from source and users do not get the luxury of using the RPMs, my first suspicion was an Ada specific issue. But after a few days of thinking on this, it occurred to me that the code was locking up long before Ada code ran. I stripped the test case down to C only and added a dummy method for gnat_main(). This used the exact RTEMS application configuration and initialization as the broken Ada program did. This program failed in exactly the same way. Feeling the rush of some success, I did a binary search to determine what configuration setting was making this fail. After a few attempts, I discovered that when the application configued zeroing out the RTEMS Work Area, the board locked up. Eureka!

With that piece of insight, I began to focus on the code that zeroed memory. I looked at the starting address and length. The BSP initialization claimed there was 1007MB of RAM available on the board. It looked fine given the board's configuration. I decided to try deliberately lowering the amount of RAM RTEMS knew about. Various low values worked and I tried 1004MB. That worked. I suddenly realized that some memory was reserved for some purpose. I went away to think since I had a work around.

I looked up the manual for the motherboard's chip set and low and behold, I had forgotten that the video controller shared RAM from the main RAM. We needed to stay away from that area. I posted to the RTEMS mailing list and it was suggested that we should trust the RAM size provided by the multiboot information available NOT the amount determined by our dynamic sizing probe. Looking at the 4.9 release branch, it turned out that that code did give priority to the multiboot information.

I compared the two pieces of code and quickly came to the conclusion that updates had broken this code in a subtle way. A little restructuring and clean up and the problem was resolved. But it had been a long and circuitous path from initial report to problem resolved.

What are the lesson to learn?

In this case, there are two lessons to take away. First, a problem may not be what it first seems to be. My first impression was that this was an Ada related issue not a very low BSP specific issue.

The second lesson is that divide and conquer is a very good strategy. You need to narrow down the problem space so it is possible to track down the specific issue.

First Blog Post

Every blog has to have a first port and this blog is no different.

I have considered having a blog for a while but never got around to it. I don't know if anyone cares to read it or not. I am the maintainer of the free real-time operating system RTEMS (http://www.rtems.org) and a member of the GNU Compiler Collection (GCC) Steering Committee (http://gcc.gnu.org). I plan to ramble on free software activities that I am involved in, interesting bugs, RTEMS improvements, etc.. I may also ramble about design patterns for use in embedded systems, performance analysis, and GNU/Linux.

This is a modest beginning to what I hope will be a blog containing interesting posts.