Recommend a parsing library for LibPQ.jl?

I’m looking to improve the parsing code in LibPQ and expand support for more data types. PCRE2 is expensive and heavyweight, but it’s a challenge to write each parser from scratch and I don’t want to reinvent the wheel. The stdlib Dates parser is also broken for basic timestamps with time zones with optional fractional seconds so I’d like to improve performance for date parsing by writing a new one for PostgreSQL’s ISO 8601 format.

My requirements are:

  • less memory and execution time than regex
  • performs well parsing small strings one at a time (i.e. doesn’t require parsing a whole file at a time to reach good amortized performance)
  • handles escaping and quoting (like you might expect to encounter in CSV files)

My ideal features (not necessary):

  • composable (if I have a Foo parser I can use it with an Array parser to parse Arrays of Foo)
  • can parse UTF8 AbstractStrings as well as arbitrary bytes
  • can parse regions in a string (from one position to either a second position or a sentinel)
  • documented, maintained, tested

Does anyone have recommendations? I can see there are lots of parsing libraries but I’m not sure which are still active and which would be the best for my use case.

2 Likes

I’d be willing to help you use Parsers.jl; it sounds like it fits most of what you’re looking for, both requirements and ideal features. It has always been a bit hard-coded towards CSV.jl, but over the last little while, we’ve been improving the internal APIs to be more robust and less susceptible to accidental breaking. In particular, it can:

  • Take a string/byte vector, starting position and ending position and parse a requested type
  • We have internal composable “layers” like whether quotes are accounted for, stripping whitespace, checking for delimiters, etc. It wouldn’t be too hard to add additional layers that compose together not unlike a server-side middleware stack
  • Uses the Parsers.Result{T} object when calling Parsers.xparse, which returns a value if parsing succeeded, an Int16 “result code” bit field, and the # of bytes consumed from the input while parsing
  • Has a completely stand-alone implementation of Dates parsing that operates on bytes instead of characters. It has default ability to parse TimeZones as well if the TimeZones.jl package is loaded. We could probably massage the precision parsing as well since that has definitely been a pain-point w/ Dates lib support.

Anyway, happy to do a call and walk you through how the APIs work if you think it sounds like it would work. @drvi and @nickrobinson251 have been helping a lot to document and improve these “next layer down” APIs.

4 Likes