chris blogs

May 2005

24may2005 · Parsing Vooly in C

Yesterday I started writing a Vooly parser in C, mainly for reasons of speed and to provide bindings for lots of other languages without needing to reimplement the exact (yet unfinished) specification.

Before I did that, I extended Vooly with a new mechanism for escaping. Until now, it was not possible to write << and >> as text in Vooly, because those sequences were delimiters. With the new variable length terminators, you can easily do that by just using more angle brackets, for example:

<<note
To bitshift in C, use code like:
  <<<code foo = bar >> 2 >>>
or, to shift left:
  <<<code foo = bar << 2 >>>.
Pretty easy, no?
>>

Since the code blocks start with <<<, you need >>> to close them again (and at least <<< to open a new tag inside).

The “fun” thing about this addition was that it totally crushed by old Ruby parser that was based on StringScanner, and about 30 lines long. It turns out that the syntax is not regular anymore, and therefore is hard to parse using ordinary regular expressions.

All my Vooly parsers are pull parsers because these are comparatively easy to implement (push can be easier, sometimes), but also pleasant to use and quick.

Now, this is the way I wrote the C implementation: First, I rewrote the Ruby parser until it could run the old test-suite I had for the previous parser. Then, I added tests for variable length terminators and… I created a monster. I actually wrote a parser that would dynamically generate and compile regular expressions, keep them on a stack, and match them against the input string.

Seriously, I never wrote such a weird parser. Look at some snippets:

TEXT_REGEXP = Hash.new { |hash, key|
  hash[key] = /.+?(?=<{#{key.size}}|(>{#{key.size}}))/m
}
    ...
    elsif @scanner.scan(@openrx.last)
      @textrx << TEXT_REGEXP[@scanner[1]]
      ...
    elsif @scanner.scan(@closerx.last)
      ...
      @textrx.pop
    elsif @scanner.scan(@textrx.last)
      ...
      @text = @scanner.matched
    ...

It was among the most craziest code ever, but it told me my idea worked as I wanted. It was about 60 lines of Ruby. Still, if I were to implement it like that in C, I’d probably have gone crazy. You maybe should know that I didn’t do (serious) C in a long time, and I’m totally spoiled by Ruby’s string handling, garbage-collection etc.

Another, even more serious problem was that my nice parser only read from Strings (due to StringScanner), but I also wanted to read from streams (IO). Therefore, rewrite! Just too good I have a fairly complete test suite…

I reimplemented the parser in Ruby rather quickly; according to SLOCCount, it is whole 108 lines of Ruby as now, fully featured. I tried hard to use not very Ruby-specific features; the code doesn’t use regular expressions for parsing at all, it’s just a simple state machine (oh I love those :-)) with five states. Also, I don’t use a lot of OO features, only instance variables. The rest is written in a plain old procedural style. While it surely isn’t the most elegant Ruby I ever wrote, this is extremely helpful for the next step.

Let’s see what features and methods of Ruby we use:

  • String, Array, Integer, Symbols, true, false, nil
  • One user-defined class with instance variables
  • String#strip!, String#[], String#empty?, String#<<, one trivial substitution (@text.sub! /\s$/, '')
  • IO#getc, IO#eof?, IO#ungetc
  • Array#last, Array#pop, Array#size, Array#<<

That was it. And, believe me, that part of Ruby was very easy to write in C. I basically could go over the source and translate it line by line. Symbols would get enums, my class would get a struct, the instance variables its fields. String#strip! can be written in 5 lines of code, String#<< needs a bit help because of the memory management, but it’s not too bad. The IO methods map directly to the C standard library, and the Array stuff can be written using a statically allocated array and a “head pointer”. Just as you’d usually implement stacks in C.

I think I wrote the guts of the C parser in about one and a half hour. I ported some test cases over, and they ran fine. I estimate the total time to port the second Ruby version to C to about five hours, including documenting the API. The C version is, as of now, 275 lines of code according to David A. Wheeler’s SLOCCount. Not very much, in the end.

If I were to write the C parser directly, I fear I would have lost track early, have no chance to test it easily, and if my idea wouldn’t have worked, it would have wasted a lot of time. Clearly, I wrote lots of “superfluous” code during the port, but it helped me getting a good understanding of the code and its working.

Using a more flexible language just allows for much shorter turnaround times and makes development a lot more fun. I do rather see 30 “undefined method foo” exceptions than three “segmentation faults” per hour. Debugging Ruby is just a lot easier. I debugged the Ruby versions using p and testcases only, but I would have wasted a lot of time if I didn’t use gdb for checking some things in the C version. Lucky me, I didn’t need to open gdb at lot.

All in all, I wrote a piece of non-trivial C code for a new idea by

  1. Implementing/extending a test suite in Ruby.
  2. Implementing a new parser using complex Ruby without limitations.
  3. Testing the implementation and extending the testcases for the new feature.
  4. Implementing the new feature.
  5. Rewriting the parser in “low-level” Ruby until it passes the test suite.
  6. Porting the low-level parser to C.
  7. Testing and documenting the C parser.

Every step is clearly defined, and the list “just” can be followed sequentially. Should any unexpected things happen, feel free to backtrack until you get everything working.

I am very confident with this way of developing efficient and short C code, and will likely do it that way in the future again, should (there will…) be need for it.

NP: Dan Bern—Mars

Copyright © 2004–2013