leah blogs: Learning Regexp Optimization the Hard Way

Playing a bit with Nukumi2, I discovered a very interesting thing: If I disable BlueCloth, my pages suddenly render a lot slower. “This ain’t possible”, I stumbled. BlueCloth is fairly fast, but still the slowest component of the rendering stage. “What the hell is going on?”

I timed the rendering of a few entries by themselves, they were all faster than with BlueCloth. Well, all except one. Yesterday’s. I trimmed it a bit, and finally found out that if I removed that part, it would run at full speed again:

class Class
  def attr_inject(attribute)
    (@__attributes_to_inject ||= []) << attribute
  end
end

I ran a profiler (ruby -r profile), and it turned out String#scan was taking quite long… Where is it used? In RubyPants, look at the code:

def tokenize
  tag_soup = /([^<]*)(<[^>]*>)/

  tokens = []

  prev_end = 0
  scan(tag_soup) {
    tokens << [:text, $1]  if $1 != ""
    tokens << [:tag, $2]

    prev_end = $~.end(0)
  }

  if prev_end < size
    tokens << [:text, self[prev_end..-1]]
  end

  tokens
end

Ok, it’s a part of the HTML tokenizer… The only reason scan can be slow is the regular expression, let’s have a closer look:

tag_soup = /([^<]*)(<[^>]*>)/

Well, it looks alright: First match any preceding text and then a tag. After toying around for fifteen minutes with that regexp, trying various tweaks (e.g. making it non-greedy), I built a proper test case. That string took very long, ~6 seconds to parse it 100 times:

&lt;hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh...

Even more h to come. I remembered a scary issue about regular expressions: Under certain circumstances, backtracking can make Regexps run in exponential time (various string sizes showed that) because they try every possible match.

It really looked like that was the case… But where the hell does it backtrack? There are no alternations, and the match should run linearly. Well, it should. But it didn’t. It turned out that it really tries every possible match, because the match isn’t anchored.

D’oh! Of course, there is no anchor. Luckily, I remembered \G from back when I used Perl. man perlre says:

       \G  Match only at pos() (e.g. at the end-of-match position
           of prior m//g)

Hey, cool. That should be exactly what we need. Make that line simply

  tag_soup = /\G([^<]*)(<[^>]*>)/

and off it goes! Blazingly fast. :-)

Who would have thought two characters can make such a big difference?

(Of course, I’d have never discovered that “bug” if I only ran it on correct input (as >> is not valid HTML), so test your code with “invalid” data too!)

NP: Sum 41—We’re All To Blame

leah blogs

17nov2004 · Learning Regexp Optimization the Hard Way