The Making of Entrez Gene parsers in Perl using
Parse::RecDescent, Parse::Yapp, Perl-byacc and Regex

Purpose: Make an Entrez Gene parser in Perl

The National Center for Biotechnology Information (NCBI) completes the transition from Locuslink to Entrez Gene on March 1st, 2005. For Locuslink users this poses several problems:

There is currently no publicly available Perl parser for Entrez Gene annotations.
NCBI does not provide XML-formatted Entrez Gene file for download.
NCBI's Eutils provides XML-formatted Entrez Gene records but the format is both huge and awkward.
More importantly, the web-only eutils is not a feasible solution for processing whole genomes.
The only free Entrez Gene parser is in the NCBI toolbox, but it has a steep learning curve and is available only in C/C++.
Entrez Gene record is in ASN.1 text format, which has practically no support in cpan modules or anywhere.

Solution: Well, we'll just have to build our own Entrez Gene parsers in Perl. Why in Perl? Because:

Perl is the most popular programming language among Bioinformatics researchers.
An object-oriented Perl parser would be easy to use and maintain
Perl parsers could interface well with the vast number of public Bioinformatics tools (e.g. Bioperl and EnsEMBL) and any in-house tools developed in Perl.
Pure Perl parsers support all the operating systems that Perl has a virtual machine for and require no effort of porting.
If we put strong emphasis on performance optimizations the parser should work out well

Tools: Regular Expression or YACC(-like) Perl modules

So we decided to basically build a parser that abstract a data structure from a given Entrez Gene record, and users of the parser can decide what to do/extract from the data structure. For this purpose, we consider several automatic parser generators as well as the venerable Perl5 regular expressions. The tools we considered are: Parse::RecDescent, Parse::Yapp, Perl-byacc, Text::Balanced and Regex. Text::Balanced was eliminated since any approach using this module would still do the bulk of the work in regex and it is also somewhat redundant since Parse::RecDescent uses this module as well.

Gory details of using the tools in a practical project

Parse::RecDescent

I chose this module to write my first EntrezGene parser because of a simple reason: I've heard of it before, and popularity almost always comes from good traits such as easy to use and debug, good documentation, etc.

Turns out that it's a good choice - it has by far the most detailed documenation among the tools. While it takes time to read and assimilate the dozens of pages of documentation, once you're done with the documenation, you can almost immediately write your parser, which was what I did.

First grammar for my first parser:

startrule: 'Entrezgene' '::=' block { $item[3] }
block: '{' row(s) '}' { ::convert(\@item) }
row: unit ',' { $item[1] } | unit ...'}' { $item[1] }
unit: name value { $return = {$item[1] => $item[2]} } |
      value { $item[1] }
name: identifier { $item[1] }
value: block { $item[1] } |
       '"' /[^"]*/ '"' { $item[2] } |
       name value { $return = {$item[1] => $item[2]} } |
       identifier { $item[1] }
identifier: /[A-Za-z0-9_-]+/ { $item[1] }

Not a bad first try. It works, and builds a nice data structure for me, thanks to the convert() function I write (can be found in Util.pm). The only problem: the performance was bad - it takes 72 seconds to parse Entrez Gene record #1, which is a moderately size entry. Hmm, how do we optimize it?

Simplify the grammar through reduction in rules and productions probably would help. Indeed, the 3rd version yield a parser that takes only 30 seconds to parse record #1:

startrule: 'Entrezgene' '::=' block { $item[3] }
block: '{' row(s?) endrow '}' { ::convert(\@item) }
row: unit ',' { $item[1] }
endrow: unit ...'}' { $item[1] }
unit: /[A-Za-z0-9_-]+/ value { $return = {$item[1] => $item[2]} } |
      value { $item[1] }
value: block { $item[1] } |
       '"' /[^"]*/ '"' { $item[2] } |
       /[A-Za-z0-9_-]+/ .../[,}]/ { $item[1] } |
       /[A-Za-z0-9_-]+/ value { $return = {$item[1] => $item[2]} }

After a few more productions were squeezed out, it takes only about 22 seconds to parse record #1:

startrule: 'Entrezgene' '::=' value { $item[3] }
value: '{' row(s?) unit '}' { ::convert(\@item) } |
       /[A-Za-z0-9_-]+/ .../[,}]/ { $item[1] } |
       '"' /[^"]*/ '"' { $item[2] } |
       /[A-Za-z0-9_-]+/ value { $return = {$item[1] => $item[2]} }
row: unit ',' { $item[1] }
unit: /[A-Za-z0-9_-]+/ value { $return = {$item[1] => $item[2]} } |
      value { $item[1] }

Now this optimization step gave a huge gain of about 15 fold from the version above! Now it takes only about 1.5 seconds to parse record #1. The reason for such big improvement? By moving ',' into rule (unit) the rule (row) is gone, reducing unexpectedly large number of backtracking during the recursive descent execution of the LL-grammar based Parse::RecDescent. Of course, my moving '"' /[^"]*/ '"' up helped speed as well because Parse::RecDescent check productions one by one in this order. So move the faster productions up won't hurt

startrule: 'Entrezgene' '::=' value { $item[3] }
value: '{' unit(s) '}' { ::convert(\@item) } |
       '"' /[^"]*/ '"' { $item[2] } |
       /[A-Za-z0-9_-]+/ .../[,}]/ { $item[1] } |
       /[A-Za-z0-9_-]+/ value { $return = {$item[1] => $item[2]} }
unit: /[A-Za-z0-9_-]+/ value /,?/ { $return = {$item[1] => $item[2]} } |
      value /,?/ { $item[1] }

Note that one could really overdo the "reduction of rules/production" rule though - the version below which practically has only one rule (the startrule can be eliminated) actually runs about 20% slower than the version above.

startrule: 'Entrezgene' '::=' value { $item[3] }
value: '{' value(s) '}' /,?/ { GI::Parser::EntrezGene::convert(\@item) } |
       /[A-Za-z0-9_-]+/ value /,?/ { $return = {$item[1] => $item[2]} } |
       '"' /([^"]|"")*/ '"' /,?/ { $item[2] } |
       /[A-Za-z0-9_-]+/ /,?/ { $item[1] }

Again, the gain/loss is related to gain/loss in backtracking during execution. So lessions learned from the optimization process? Get rid of as many rules/productions as you can, as long as the reduction would lead to less backtracking, the major speed bottleneck for Parse::RecDescent or other LL-grammar parsers. It could be hard to judge if one particular reduction of rule/production would lead to less backtracking or not (unless you draw parser tree out and judge, I guess), so I would recommend benchmarking in such situations as an easy test.

Conclusion: Although it was my first time to use Parse::RecDescent, it was not painful at all (well, admittedly, reading the doc was a bit painful, but one'd appreciate the long documentation when really starting to write grammar/actions). The time consumed: one day for documentation and web research, one day for writing grammar, debugging, optimizations. I'm particularly impressed with the savings achieved through optimizations and how fast/easy it was to use Parse::RecDescent for the first time in a practical project, despite that I did not find real practical examples.

Parse::Yapp

My thrill with the working parser based on Parse::RecDescent did not last long - Soon I discovered that some long Entrez Gene records like entry #4539, took 20 minutes for my parser to parse! What was startling was that #4539 was only about 26 fold bigger than #1, yet it took 800 fold longer to parse!

After plenty of web searches, trying other optimization possibilities, I reached the conclusion that my grammar was near optimal already (Parse::RecDescent gurus probably can do it faster, but not by a lot). The problem with long records was mentioned a lot of times on web, and plenty of users mentioned that the author of Parse::RecDescent (Damian Conway) promised big performance improvement during X'mas 2003 or so. Yet version 2.0 of Parse::RecDescent never arrived, so the performance problem cannot be cured. That's when I took the advice of Damian, who said in a mailing list archive that if speed is a primary concern, check out Parse::Yapp and Perl-byacc.

My choice of going Parse::Yapp next is again simple: Parse::Yapp has some documentation while Perl-byacc almost had none (in fact, it doesn't even have a web site, not even one page dedicated to it on cpan!). Parse::Yapp is also pure Perl and generates object-oriented parsers.

Using Parse::Yapp is easy, but learning is not. I wasted a lot of time searching the web or fumbling through the scant documenation trying to find out how to do things. Even with a mature LL-grammar ready, convert() function ready, it still took me nearly 1.5 days to finish my parse using Parse::Yapp (and frankly, most of the time were purely wasted). There's certainly a lot of room for author to improve the documentation, especially for first time users. For people familar with YACC parsers and/or used bison or byacc before, this tool should be relatively easy to use (still, some doc needs to be present to lead users through the transition from C to Perl).

Another disadvantage it has vs. Parse::RecDescent is that it needs user to supply a lexer function (tokenizer, the function that feeds parser with tokens, which would then be used to match with rules and productions). This is rather inconvenient, even if one could resort to some other C-based tools like Lex or Flex to generate lexers automatically and translate into Perl (I didn't find Perl bindings), which is of course a hassle in its own right.

What's more, the debugging and optimization is tough. I frequently get shift/reduction errors with attempts at optimization. These are definitely not very informative error messages, especially when they're all you get whatever you try :) Well, my own ineptitude aside (to be fair, I did not generate a debug version which would generate somewhat more reasonable error messages), it is still rather easy to see how Parse::RecDescent beats this module hands down in this aspect with its built in customizable trace facility. Keep in mind though, that I already have a mature LL-grammar to start with after dealing with Parse::RecDescent. The LL-to-LR conversion in this case was easy. Theoretically the optimization I've done for Parse::RecDescent's LL grammar helps in the speed of the converted LR grammar too since reduction of rules help in both cases.

In spite of all the negative things I have to say about Parse::Yapp so far, I would still rate the module rather high - simply because its parsers are so much faster than Parse::RecDescent parsers! I was able to, after the non-exhustive optimizations (again, most were done when I optimized for Parse::RecDescent), take the time down to about 1.5 minutes to parse entry #4539, as compared to 20 minutes using the Parse::RecDescent parser. (and after one more optimization, the parsing time was down to about 6 seconds! Read on for a not-so-fresh tip on how such speedup was accomplished)

The final grammar used:

rows: row { (defined $_[1])? [ $_[1] ] : [] } |
      rows row { push(@{$_[1]},$_[2]); $_[1] }
;
row: '\n' { undef } |
     value ',' '\n' { $_[1] } |
     value '\n' { $_[1] }
;
value: '{' rows '}' { convert($_[2]) } |
       STATEMENT { $_[1] } |
       IDENT value { my $tmp = {$_[1] => $_[2]} } |
       IDENT { $_[1] }
;

Conclusion: Although it was my first time to use Parse::Yapp, it did not take me too long to produce a parser. However, I would say that Parse::RecDescent helped tremedously in preparing me for using Parse::Yapp, both because of the similarities shared by the modules and the lack of detailed documentation in the latter module. If a total novice starts coding in Parse::Yapp first, it'd probably take at least several days to produce a working parser. Once familiar with Parse::Yapp, however, I would not go back to Parse::RecDescent unless I absolutely have to, despite all of its advantages 'cause, yeah, the speed difference was that big.

Perl-byacc

Just for fun, I checked out perl-byacc too. The lack of documentation for this module is beyond belief - there's virtually none. The author's company seemed to have been purchased by another company, so the module's web site and official download site were all gone. Without experience with Parse::Yapp and reading documentation of bison and byacc (which has little doc) (and trying desperately to imagine how their Perl counterpart should be used), it'd be very hard to use this module.

As with Parse::Yapp, Perl-byacc comes with a couple example grammar files, which contain lexer function as well. However, as with Parse::Yapp, the authors left few comments in the code for users to understand it and certainly the code wasn't written in best style (not that I have better coding style, but I would've added comments if the scripts are written as examples for an otherwise no-documentation-whatsoever module). What's more, as with Parse::Yapp, the lexer functions were written in a very inefficient way due to the hidden assumption that input for parsers would be short strings. (read further why)

Perl-byacc and Parse::Yapp both take YACC grammar. They are similar in every way (even the debugging works similarly by requiring a special debugging version, a trait of their C descent I'd guess)., therefore despite the total lack of documentation, I was able to write a working parser within half a day. And quite frankly, it'd have taken me (or most people) 30 minutes max to adapt from Parse::Yapp to Perl-byacc had there been any usable documentation. Still, I had to use a simple way to trick produce a object-oriented module using Perl-byacc 1.8.2 (although later I found a patch (I did not test) that could enable the generation of OO modules using Perl-byacc, and Perl-byacc 2.0 is said to support OO by default, although again, no doc or comment).

The final grammar used:

rows: row { $$ = (defined $1)? [ $1 ] : []; } |
      rows row { $$ = $1; push(@{$$}, $2); }
;
row: '\n' { $$ = undef; } |
     value ',' '\n' { $$ = $1; } |
     value '\n' { $$ = $1; }
;
value: '{' rows '}' { $$ = convert($2); } |
       STATEMENT { $$ = $1; } |
       IDENT value { $$ = {$1 => $2}; } |
       IDENT { $$ = $1; }
;

Conclusion? Perl-byacc is very hard to use for entire novice users, yet same with Parse::Yapp, once one gets familiar with it, it is not hard to use at all. Its parsers are even faster than the ones generated by Parse::Yapp. Still, documentation-wise (and even example-wise), this module needs tremendous improvement.

Regular expressions

Despite the much more acceptable performance of the parsers generated by Parse::Yapp and Perl-byacc, it still lagged far behind (30-40 fold) the regular expression based parser a colleague in our collaborating company wrote. Granted, using these YACC(-like) tools saved a lot of development time, yet the big speed difference suggests that it might be worth it to bite the bullet and write regex-based parsers for performance. There was just one strange thing - the auto-generated parsers also are mostly regex-based, especially in lexer functions, it just cannot explain such a huge difference in performance.

Regardless, I started out to make a regex-based parser for Entrez Gene. It was easier than I thought - I spent 3 hours to write and debug and got a working version out. The experience of writing lexer functions, thinking about the grammars, and optimizing performance of other parsers definitely helped greatly in writing the regex-based parser. Again, there was just one problem - the regex-based parser ran only one fold faster than YACC parsers. After spending hours to collecting statistics, optimizing and sacrificing code clarity, the parser still needs about 27 seconds to parse Entrez Gene #4539, a mere 4 fold improvements over the YACC parsers.

I decided to profile the performance of the regex parser. The profiling soon made me zero in on one particular performance hit - the string replacement in the lexer functions! Since I started writing lexer for Parse::Yapp, I followed its calc.yp's lexer function style, which chews up the input string when processing it, each time returns the next token to the parser. While this is OK for parsing short string, when parsing long input strings, the s/// string replacement becomes incredibly costly because each time a very short token gets chewed off an input string hundred of KB in size, the allocation of memory, copying etc. is very costly.

OK, the culprit is identified. How about cure? Well, we just need to replace the string replacement statements with string matching statements. But how to keep the next-token position on the string? Naturally, we'd have to use the \G + /g combination. Still, we need to throw in /c as well so that a failed match does not move the next-token position. Therefore the s/^$token// becomes m/\G$token/cg, and voila! - the speed of the regex-based parser was improved nearly 50 fold - now my regex-based parser only needs 0.5 second to parse record #4539! Although the m/\G/cg is not new trick (well documented in "Programming Perl"), it certainly did magic in this case.

After seeing the huge improvements, I immediately changed my lexer functions for the Parse::Yapp and Perl-byacc based parsers to use m/\G/cg, and saw similar improvements. It now takes about 3-5 seconds to parse #4539 even for the YACC parsers!

One thing that rained a little on my parade was that I later found out in fact Parse::Yapp used m/\G/cg in its YappParse.yp file, which is the grammar file for Parse::Yapp itself. However, its calc.yp, a standard example used by pretty much all YACC, used s///. Perl-byacc, on the other hand, always used s/// in its examples. Not good examples. :)

I also tried to improve the performance of Parse::RecDescent (PRD) since PRD also used s/// instead of m/\G/cg in its auto-generated lexers (yet interestingly it, like Parse::Yapp, used m/\G/cg in parsing grammar files, even though grammar files are likely quite small). It took half of my president day holiday and I still failed to improve the speed. In fact, m/\G/cg actually made the code run a bit slower. One possibility is that I misunderstood some code in the huge PRD module, but my modifications did not result in a dysfunctional PRD. Yet since Damian Conway definitely used m/\G/cg in PRD, and he also failed to improve the performance of PRD for years (as I tested various versions of PRD, they performed very similarly over the years), there must be some other reason in the code that's hard to find. So I hastily ended my quest to improve PRD, which was just in time to enjoy the latter half of my holiday. :)

Conclusion: Regular expression based parsers offer the best performance, yet it is not easy to write it as one has to code both the lexer and parser logic. For complex grammars, this could become very hard. But for Entrez Gene's ASN format, which can be described by simple grammar, this worked out perfectly. My regex-based grammar could finish processing human genome in about 11 minutes, mouse genome in less than 9 minutes while rat genome in slightly more than 3 minutes. No bad for a pure Perl parser working on nearly 1 GB of data with over 146000 Entrez Gene records!

Final thoughts on the parsers

After using Parse::RecDescent, Parse::Yapp, Perl-byacc in a practical project and comparing them with a regex-based parser, I could offer the following suggestions:

For data with simple grammars - consider using regex directly; if it's hard, trying using Perl-byacc (once you get familiar with it).
For data with complex grammars - consider Perl-byacc (once you get familiar with it).
For data with context-sensitive grammar - consider Parse::RecDescent (yet if performance is a real concern, one could consider something like ANTLR, which is said to be of good performance, although it's in Java; or SLK, which seems to have very impressive speed)
For any data that can be defined by a context-sensitive grammar where performance is not an issue - use Parse::RecDescent

Also note that the EntrezGene.y and EntrezGene.yp included in my "Entrez Gene Parsers in Perl" sourceforge project could add the the comment-less examples for Parse::Yapp and Perl-byacc ;) Yet I hope my modules could help clearing things up a bit.

The regex-based parser is without a doubt the speed champ. Its code is also very short and easy to maintain and debug. Therefore it is recommended for use with parsing Entrez Gene records in practical projects.

Attempts at explaining the performance differences

Well, there are some interest in understanding the reasons for the speed differences observed for the parsers I created using these various tools. The most valuable conclusion of such attempt would be whether the performance differences observed with here with my 4 parsers are typical of the parsers generated using these tools. In other words, can the benchmark be extrapolated to other practical projects?

My thinking is: Yes, the benchmark here can be extrapolated. Now the answer lies in my attempt to explaining the performance differences between these parsers. It really can be transformed into a few questions & answers:

Why is Parse::Yapp slower than Perl-byacc, although they're both LR-based YACC type of parser generators? Upon inspection of the code, it was highly suspicious that a few things in the implementation of Parse::Yapp could have led to the speed difference. The handling of states and actions are all different. Most importantly, and possibly the main reason for the difference in speed, is that Parse::Yapp handles actions through anonymous functions, while perl-byacc does it in-line. With large files with hundrends of thousands of tokens, this definitely could make a difference. Yet due to the speed of the function calling on modern hardware, I'm almost sure other parts of the coding of Parse::Yapp had something to do with its slightly slower speed too. One thing's for sure, however - it's likely that the parsers generated by Parse::Yapp will always be slower than one based on perl-byacc.
Why are Parse::Yapp, Perl-byacc, and regex parsers having linear performance? It's easy to see why if you check the regex-based code. In the main parse function, there are only a fixed number of regex/rules to check/generate tokens. Since the process of generating token take proportional time vs. input size, each round of execution in the parse function have constant execution time boundaries (which are directly related to the complexity of grammar, that's why optimization of grammar's important), even after throwing in a bit recursion cost, the overall result is linear. Perl-byacc does away with recursion & function calls too, but its larger auto-generated lookup table, state transitions and function calls between parsers/lexers are not as efficient as the handcrafted regex-based parser.
Why is Parse::RecDescent parser non-linear? This is harder to explain. Theoretically, LL-grammar-based parser should also behave linearly. Indeed, for longer input, the number of backtracking/recursion in Parse::RecDescent does not increase for the same input, yet the processing time for EXACTLY the same tokens/productions/rules were much longer! My discovery of Parse::RecDescent's usage of s/// in its logic suggests that this is the reason, yet after replacing all of them with /\G/cg the parsing speed was even slowed a bit further! So although I still strongly suspect s/// has something to do with it, my own hasty and failed attempt at improving Parse::RecDescent cannot confirm my thoughts. Thus it is also possible that some other part of the code in Parse::RecDescent caused the speed slow down. (BTW, granted the function calls, code evaluations in Parse::RecDescent for each production slows things down, but they cannot explain an O(n3) behavior). Yet, once again I believe this performance problem with Parse::RecDescent can be extrapolated to all parsers based on it, which is unfortunate, especially since it's so convience to use.
For a somewhat distantly related question, is LL-grammar parser slower than LR-based grammar? Nope. I don't think so. AFAIK, there's no such theoretical basis for it, the performance problem of Parse::RecDescent cannot be extrapolated to LL parsers. In fact, the LL parser generator SLK even got rid of recursion and uses a lookup table, which makes it really very impressively fast (the fact that it is probably benchmarked in C won't hurt either). The claimed speed was that total program execution time for parsing 5 MB DB-language (very likely more complicated than my Entrez Gene parser) took under 2 seconds on an old 133 MHz computer! Too bad it doesn't have a Perl binding yet. The LL-based ANTLR is also pretty impressive, and I wish it was available in Perl (sigh). Guess some day they will.

Note: This document is written for my project Entrez Gene parsers in Perl. All project files can be found here.

Page created and maintained by: Dr. Mingyi Liu (GPC Biotech)

Project hosted by:

The Making of Entrez Gene parsers in Perl using Parse::RecDescent, Parse::Yapp, Perl-byacc and Regex

The Making of Entrez Gene parsers in Perl using
Parse::RecDescent, Parse::Yapp, Perl-byacc and Regex