leah blogs: Pulling from Streams

It is interesting that in the “real world”, most things go easier into one direction than in the other. For example, it is easier to break a coffee cup than to glue it together again. I think this has something to do with entropy.

Computer programs are not unaffected by this fact, take for example the two styles of on-the-fly XML parsing, Push vs. Pull. Translating a Pull API into a Pull API is done trivially, such code usually looks like that:

while parser.pull
  case parser.state
    when :open
      receiver.open parser.text
    when :text
      receiver.text parser.text
    when :close
      receiver.close
  end
end

Doing the reverse—converting a Push API into a Pull one—looks impossible at first, but that is not true. It can be done (albeit only in limited ways in Ruby, see below) using our beloved continuations.

I have written a small class, Stream2Pull that does this crazy task:

class Stream2Pull
  attr_reader :data

  def initialize(&block)
    @data = nil
    @continuation = lambda { instance_eval &block }
  end

  def method_missing(*arguments)
    @data = arguments
    @again = true
    switch
  end

  def pull
    @again = false
    switch
    @again
  end

  def switch
    save = @continuation
    callcc { |@continuation| save.call }
    save = nil
    GC.start
  end
end

Stream2Pull takes a block it will call on itself on the first execution of pull. This code is expected to send undefined messages to the instance. Then the “magic” happens: With the call of switch, the context of the program changes from method_missing to the call of pull, which then returns. Now the caller can examine the undefined message—the event—by looking at data. Let’s wrap REXML’s StreamParser that way:

s_to_p = Stream2Pull.new {
  REXML::Parsers::StreamParser.new(<<XML, self).parse
<?xml version="1.0"?>
<info>abc<em>hehe</em></info>
}

Now, we can use our new object just like a pull parser:

while s_to_p.pull
  p s_to_p.data
end

We just repeatedly pull from the parser, thereby always switching contexts between the inner loop of the StreamParser and ours. These are the events that are successfully emitted:

[:xmldecl, "1.0", nil, nil]
[:tag_start, "info", {}]
[:text, "abc"]
[:tag_start, "em", {}]
[:text, "hehe"]
[:tag_end, "em"]
[:tag_end, "info"]
[:text, "\n"]

We just have converted a Push API into a Pull API!

Now, we can talk about the dark side of the medal. Ruby’s continuations are very slow and they eat loads of memory. When I tried parsing a 1.5 megabyte file—the largest one I could find on my disk, it is actually the XSL specification—I didn’t yet have the GC.start inside switch. In about 2 minutes, my Ruby gobbled 400 megabytes of memory! After enforcing garbage-collection, it took less memory but memory usage was still raising constantly. After 15 minutes and 100 megabytes used I gave up.

These are the things I really want to see in (I hope not too far away) future Ruby 2 implementations: Efficient tail calls, continuation passing style, and possibly support for partial continuations. If you implement CPS, you can get the others almost for free. Those features would make above code run at about the speed of a “native” pull parser. Alternatively, please port the Ruby standard library to Scheme. }}:-) Just kidding.

If you just want to parse XML, it’s all not too bad, actually. This is because REXML’s StreamParser *is* actually a wrapper around a PullParser (REXML::BaseParser)! (Of course, I knew that in advance.) Therefore, we can easily write:

bp = REXML::Parsers::BaseParser.new(ABOVE_XML)
while bp.has_next?
  p bp.pull
end

And it will generate exactly the same output as above. :-) The main difference is, well, it uses constant memory and only needs 15 seconds to parse the big XML document. And that’s not too bad for a pure Ruby implementation.

NP: Bob Marley—Three Little Birds

leah blogs

26may2005 · Pulling from Streams