Monday, May 14, 2012

Using sed to Remove CVS Ids from RTEMS

Over the Christmas break, the RTEMS Project converted from CVS to git. We have all made mistakes as we transitioned to git and will admit to still be learning. But our workflow is improving and we are making fewer mistakes. However there are still a number of outstanding tasks left over from the transition:

  • Remove CVS $Id$ Strings
  • Move from ChangeLog files and define new "history" file which lists major changes
  • Convert release procedure to git
Since it has been well over four months since we initiated the conversion to git, I decided it was time to push at removing the Ids. Being comfortable with bash and sed, I decided to see how many of these I could remove without editing a file by hand.  This turned out to be harder than I expected.

First, there are a lot of types of files in RTEMS and comment structure varies accordingly. There are C-style comment blocks, C++ one-line comments, "# to end of line", "; to end of line", etc.
Second, even within a single language, there were differing CVS Id string formats.  For example, I expected most Id strings in C code to be at the end of a comment block like this:


 *
 *  $Id$
 */



But sometimes a file would have something like this:


 *  end of a comment paragraph
 *
 *  $Id$
 *
 */



In the above case, removing the $Id$ line and the preceding line would leave an undesired blank line at the end of the comment block.  I implemented this as a set of sed transformations. They were applied in an order which would match the longest sequence I had identified followed by others which matched shorter sequences. On top of that, there were limits on how many transformations could occur inside a single invocation of sed. It was easier conceptually to take the output of a single set of sed transformations and then apply another set of transformations.  The following sed command file dealt with removing the $Id$ and the preceding comment line:

# Remove CVS Ids which are more or less like this (embedded C, preceding)
# ^ *
# ^ *  $Id
/^ \* *$/{
 N
 #
 /^.*\n \* *\$Id.*/d
 /^.*\n \* *@(#) \$Id.*/d
}
Note that sometimes the code had " * @(#) $Id$" and this single sed command file would properly remove both patterns.

The solution turned out to be a set of sed command files and a shell script. The shell script invoked sed multiple times in a pipeline. Each stage in the pipeline performed a different transformation and that was fed into another invocation of sed with a different command file.

In the final step, if the first line of the file had ended up as a blank line because of the transformations, I removed it with this sed command file:


# Remove the first line when it is blank
1{
 /^$/d
}

In the end, I ended up with 28 sed transformations in an 8 stage pipeline. The scripted edits turned out to be a 2.7 megabyte diff which modified 6376 files. I was left with 117 files to edit by hand and only one file mishandled by the script. The changes can be viewed at http://git.rtems.org:
I hope it is understood that I compiled a lot during this effort. I wanted to make sure that every file I had touched was compiled. In doing this, I learned that a number of odd configurations in RTEMS had not been built recently and were, in fact, broken before I started.

If anyone is interested, I can make the shell script and set of sed scripts available but I must disclaim that they were not written to be classroom sed examples. They are understandable but probably not optimal nor perfect. They were written for a one-time massive edit and will only be reused to remove the CVS Id strings from other modules owned by RTEMS.

This could certainly have been done with other tools including awk and Python. But I knew sed and bash and using what you know is often the most effective way to do the job.

1 comment: