leah blogs

December 2004

28dec2004 · Push vs. Pull

Push me, pull me… or just pull me out.
— Pearl Jam, Push Me, Pull Me

Currently, there are three popular ways of parsing XML: The DOM, the SAX and Pull APIs.

DOM is probably the most well known, and it did a great job showing off XML’s slowness. This is because DOM reads the full XML into the core as a tree structure. The nodes of the tree mainly consist of Element nodes and Text nodes. DOM is very easy to use, but also very inefficient. Ruby’s REXML is mainly used in the DOM mode; although it’s API is rubyfied and made a lot nicer, the core idea is the same: provide a tree structure to XML documents.

The downsides of DOM (and the complexity of implementation) turned out pretty quickly, and soon the Simple API for XML (SAX) was implemented in Java. SAX is a streaming API, the SAX parser scans over the document and emits events. For example, the simple document

<doc>This is a <em>very</em> simple XML document.</doc>

emits these events (simplified):

start_element "doc"
text "This is a "
start_element "em"
text "very"
end_element "em"
text " simple XML document."
end_element "doc"

Usually, the program defines a new class whose methods get called by the SAX parser.

SAX is very efficient (the API is quite like the Expat one, which is well-known being the fastest XML parser on earth), but can be amazingly complex to use. (Personally, I don’t have a hard time with it—at least for easy stuff—, but there are people that don’t like state machines.)

REXML also provides a SAX interface, which is quite fast, but of course not like the real thing. If you want absolute performance, I still can only recommend the Ruby Expat bindings. Though the binding was not updated recently (my version is from 2002), it runs like hell.

The last, and the IMO most interesting way to parse XML is a pull API. As usual, the first implementation was written in Java (XPP1, I think) and the API was only informally (i.e. by implementation) defined. The XmlPull API changed that. For Java, there now exist a bunch of XML pull libraries that can all be accessed using the same way.

REXML provides a Pull API, too; but has been marked experimental and they don’t recommend production use. Besides, it has problems with namepaces (which is true for REXML in general. :-/).

Although being technically one step lower than SAX, Pull APIs are much more intuitive to program. The Pull API requires you to fetch each event by yourself, therefore you don’t need a state machine as you can simply get the next events when you need them, not when they arrive. This can be shown best at a simple example:

My old homepage provided an index facility, where words marked in idx and hidx (for hidden index) are aggregated into a special page. I implemented the program first using REXML and its DOM API. The page to link to can be found out getting the name attribute of the page root node. I basically needed an index word to page hash. Here is the parsing code:

doc = Document.new(File.open(f))
htmlfile = XPath.first(doc, "/page/@name").to_s

["//idx", "//hidx"].each { |xp|
  XPath.each(doc, xp) { |child|
    word = child.text.squeeze(" ").strip
    index[word] ||= []
    index[word] << htmlfile
  }
}

This program was a total performance hog, and intolerable to run (for me, at least). I rewrote it using expat:

class IdxParser < XML::Parser
  attr_accessor :index

  def initialize(encoding = nil, nssep = nil)
    @save = false
    @word = ""
    @htmlfile = ""
  end

  def startElement(nsname, attr)
    ns, name = nsname.split(";", 2)

    if ns == $IDXNS && (name == "idx" || name == "hidx")
      @save = true
      @word = ""
    elsif name == "page" && attr["name"]
      @htmlfile = attr["name"]
    end
  end

  def endElement(nsname)
    ns, name = nsname.split(";", 2)

    if ns == $IDXNS && (name == "idx" || name == "hidx")
      @save = false
      @word = @word.squeeze(" ").strip
      @index[@word] ||= []
      index[@word] << @htmlfile
    end
  end

  def character(t)
    @word << t if @save
  end
end

This program works very fast, but it was a bitch to write and lots longer, of course. For complex formats, SAX gets darn complicated, trust me.

Finally, the elegant pull API (this implementation uses REXML’s Pull API which has a bit weird API):

parser = REXML::Parsers::PullParser.new(io)
htmlfile = nil

parser.each { |event|
  if event.start_element? && event[0] == "page"
    htmlfile = event[1]["name"]
  end

  if event.start_element? && event[0] =~ /idx:h?idx/
    word = parser.squeeze(' ').strip
    (index[word] ||= []) << htmlfile
  end
}

Since the REXML Pull API didn’t fully satisfy my needs, I decided to rewrite a pull parser for ruby from scratch (Heck, I even spent a day hacking it in pure Ruby, but soon gave up.), based on the Pull API exposed by libxml2. I think above example will read something like this when it’s done:

parser = FastPull.new(io)
htmlfile = nil
while parser.pull
  if parser.start_element? && parser.name == "page"
    htmlfile = parser.attributes["name"]
  end

  if parser.start_element? && parser.name =~ /{#{$IDXNS}}h?idx/
    word = ""
    while parser.pull.text?
      word << parser.text
    end

    (index[word.squeeze(' ').strip] ||= []) << htmlfile
  end
}

Now, you have the easiness of DOM and the speed of SAX. :-) (And I can look at XML again.)

NP: Pearl Jam—Push Me, Pull Me

Copyright © 2004–2022