Recommend a parsing library for LibPQ.jl?

iamed2 · March 13, 2023, 6:38pm

I’m looking to improve the parsing code in LibPQ and expand support for more data types. PCRE2 is expensive and heavyweight, but it’s a challenge to write each parser from scratch and I don’t want to reinvent the wheel. The stdlib Dates parser is also broken for basic timestamps with time zones with optional fractional seconds so I’d like to improve performance for date parsing by writing a new one for PostgreSQL’s ISO 8601 format.

My requirements are:

less memory and execution time than regex
performs well parsing small strings one at a time (i.e. doesn’t require parsing a whole file at a time to reach good amortized performance)
handles escaping and quoting (like you might expect to encounter in CSV files)

My ideal features (not necessary):

composable (if I have a Foo parser I can use it with an Array parser to parse Arrays of Foo)
can parse UTF8 AbstractStrings as well as arbitrary bytes
can parse regions in a string (from one position to either a second position or a sentinel)
documented, maintained, tested

Does anyone have recommendations? I can see there are lots of parsing libraries but I’m not sure which are still active and which would be the best for my use case.

quinnj · March 13, 2023, 10:14pm

I’d be willing to help you use Parsers.jl; it sounds like it fits most of what you’re looking for, both requirements and ideal features. It has always been a bit hard-coded towards CSV.jl, but over the last little while, we’ve been improving the internal APIs to be more robust and less susceptible to accidental breaking. In particular, it can:

Take a string/byte vector, starting position and ending position and parse a requested type
We have internal composable “layers” like whether quotes are accounted for, stripping whitespace, checking for delimiters, etc. It wouldn’t be too hard to add additional layers that compose together not unlike a server-side middleware stack
Uses the Parsers.Result{T} object when calling Parsers.xparse, which returns a value if parsing succeeded, an Int16 “result code” bit field, and the # of bytes consumed from the input while parsing
Has a completely stand-alone implementation of Dates parsing that operates on bytes instead of characters. It has default ability to parse TimeZones as well if the TimeZones.jl package is loaded. We could probably massage the precision parsing as well since that has definitely been a pain-point w/ Dates lib support.

Anyway, happy to do a call and walk you through how the APIs work if you think it sounds like it would work. @drvi and @nickrobinson251 have been helping a lot to document and improve these “next layer down” APIs.

Topic		Replies	Views
Faster date parsing? Performance	6	675	September 8, 2020
State of the Art in Combinator Parsers? General Usage	8	1358	September 22, 2020
Storing and parsing Dates.CompoundPeriod? General Usage dates	5	374	October 26, 2022
[ANN] PikaParser.jl -- small and fast parser library Package Announcements package , announcement , parser , parsing , grammars	9	2337	August 5, 2023
PostgreSQL in Julia: LibPQ.jl Data package , data	9	5889	July 5, 2018

Recommend a parsing library for LibPQ.jl?

Related topics