Motivation
It wasn’t until a few years ago that I had the opportunity to build a parser for a new search syntax we planned to embed in Driven, a contextual monitoring tool for Cascading and other data frameworks.
One of the requirements was that the grammar be dynamic, that is, generated at runtime based on configuration or available capabilities.
Most everyone is familiar with ANTLR, a parser generator. Give ANTLR a grammar, and out comes code that, once compiled, can parse your syntax. This is a build step executing at build time, not something that happens at runtime, thus ruling ANTLR out as an option.
During my research for alternatives I came across the notion of a “parsing expression grammar” (PEG) and was happy by how quickly I could get my head around it. PEGs are a related alternative to “context free grammars” (CFG).
I also found quite a few implementations, one in particular in Java.
This is not a tutorial on parsers, so I won’t go into the details. But I do want to make an introduction to two parser implementations I’ve used.
Againt, the first was for the Driven search syntax. Driven relied on Elasticsearch for storage and search. It had a data/document model that allowed for hierarchical (or contextual) relationships. That is, except for the root document, every other document was child to its parent.
Being a monitoring tool, one would love the answer to the question “how many applications were running on Thursday that had a Hadoop task that ran for more than 20 minutes?”
Without going into too much detail, an Application was a parent to a Unit of Work, in turn the parent of a Job, and in turn the parent of a Task. To translate the original query, “find all Applications that had at least one child Task that had a runtime of 20 minutes or greater”.
There were a couple problems here. First Elasticsearch doesn’t really do joins, and second, it inherits the Lucene search syntax, which doesn’t like ‘units of measure’, think GB
or ms
.
So we needed a syntax that allowed a user to construct hierarchical queries that could be translated into one or more Elasticsearch queries, and also could translate ‘minutes’ and other units into primitive scalar values within those queries.
Borrowing heavily from the existing Lucene syntax, we came up with:
app:( task:( duration:>=20min ) )
Parboiled
Enter the Parboiled project, which offers both a Java and Scala API. At the time I had some confusion about the state of the project. It was clear the author wished to continue his work on the Scala API via their new project Parboiled2.
But I persisted with the original Parboiled Java API in hopes I would find any major issues early enough to not have made an unrecoverably huge commitment. Turns out I found no notable issues with the implementation, we did have to re-package it to shade ASM as other libraries we were using caused version conflicts.
As we had a well defined syntax, anything opaque would be contained in quotes. But even then we wanted to support relative time frames like “4 years ago”, “this year”, or “yesterday”. These had to be well structured otherwise finding a PEG grammar would be difficult.
At runtime, when generating the parser, it was very natural to detect that a search term was a date-time value (startTime
, finishTime
, etc). Making it trivial to have the parser allow for both absolute dates (“2015-02-10”) or a relative date (“yesterday”), via an OR
operator.
Subsequently, we had to create a new PEG grammars for absolute and relative dates/times. Thus for *Time
fields, the sub-parser would kick in and parse the quoted values and resolve them into usable values (the date of “yesterday”). This also had the effect of syntax checking date/time values — which required us to codify a large number of formats we would encounter.
Parboiled was a pleasure to work with, though I suspect the Scala API is light years beyond what the Java API can do naturally though. At the time of this writing, the Java API is showing modern commits.
It is worth pointing out that the Parboiled Java API leveraged ASM in an interesting way.
Parboiled is both a Java API and a DSL for declaring rules. Even though you use Java to define rules, you must pass these rules through the Parboiled.createParser()
method call which runs its “parser extension logic” (see the link for more details on how this works). This results in very maintainable PEG rule logic, and also makes it obvious why the author has been focusing on Scala — to allow for even more succinct and maintainable rule declarations leveraging native Scala language features.
Another Motivation
A few years later I was presented with the problem of parsing unstructured text and JSON documents with embedded values in order to redact sensitive information. That is, I needed to remove personally identifying information (names, phone numbers, addresses, etc) from emails or other records in JSON.
What’s interesting about this problem is that we had structure within structure. It made no sense to write a JSON parser when we are simply looking for phone numbers. Nor does it make sense to write a comprehensive phone number parser when there is Google’s libphonenumber
library.
Again, we had a runtime requirement, we only need to remove the current name in question, but we want all permutations of the name, even misspellings if possible. That is, for this current record, what name are we redacting and redact it, not, redact all possible names from all documents. Given the later might move us from redaction to privacy, it isn’t achievable in a reasonable time frame with limited compute.
Turns out this was a search and replace problem, that is, we want to find distinguished values and replace them with XXXX and/or other meta-data (the reason the value was redacted for debug or quality issues).
Further, we didn’t need one large parser, but we needed the ability to mix and match various match and replace functions. That is, if scanning JSON, we only need to find the ‘phone-number’ JSON field and apply the find/replace for a phone number to the value of that field. But if we were parsing a full text document, we could join (via an OR
operand) the phone number find/replace with “fuzzy name” find/replace along with the postal address find/replace function. This dramatically improves performance.
For whatever reason Parboiled didn’t seem to fit.
jparsec
So we landed on jparsec, “a recursive-descent parser combinator framework”, an implementation of Haskell Parsec. This notion of being a “combinator” was quite handy.
In practice it felt a lot like a PEG parser, as the combinators themselves matched the PEG operators. I’m sure someone will point out they are the same thing, then someone else will note otherwise. Cheers!
This library allowed us to build a framework for complex search and replace functions where we can leverage existing “sub-parsers”, more importantly we can combine these sub-parsers into larger parser combination.
libphonenumber
was definitely one library used, obviously for U.S. and international phone numbers.
And the abilitity to embed regular expressions as match operators allowed us to quickly inject work from other non-java projects. For matching postal addresses via regular expression, see the projects http://search.cpan.org/~timb/Geo-StreetAddress-US-1.04/US.pm and https://github.com/jamesrcounts/usaddress.
For name matching, the Fuzzy-Wuzzy algorithm was quite handy as a sub-parser.
I also highly recommend Jparsec.