I am excited to announce DLMReader
, a Julia
package for reading delimited files.
Introduction
DLMReader.jl
is a multithreaded package for reading delimited files, and it is designed for Julia
1.6+ (64bit OS). The package performance is a trade-off between:
- Flexibility
- Memory efficiency
- Low compilation time
- High performance
The package scans the entire file (when limit
is set, it may scan the input file twice) to gather information about the delimited file structure. This helps DLMReader
to scale well for huge files, however, this is less effective for small files. When information about the file structure is obtained, they are sent to an internal distributer. The distributer starts multiple threads (it uses one thread when threads = false
or the file size is small) and distributes the input file among them. Each thread allocates buffsize
bytes of memory and starts processing each line of its chunk. Each line is stored as a customised type, called âLineBufferâ, and is sent to the line_informat
to be pre-processed for parsing. Each thread then searches for each field in the line that is being processing and sends the raw text of each field to its value informat
er, and the result is passed to the main parser for parsing and storing. DLMReader
uses the Julia
base functions for parsing Integer
s, Real
s, String
s, the Dates
package for parsing DateTime
s, and the UUIDs
package for parsing UUID
s.
Features
DLMReader.jl
has some interesting features which distinguish it from other packages for reading delimited files. In what follows, I will go through some of those features;
-
Informats
: TheDLMReader
package usesinformats
to call a class of functions on the raw text before parsing its value(s). This provides a flexible and extendable approach to parse values with special patterns. For instance, using the predefined informatCOMMA!
allows users to read a numeric column with âthousands separatorâ and/or the dollar sign, e.g. using this informat, the raw text like â$12,000.00
â will be parsed as â12000.00
â. Moreover,informat
s support function composing, e.g.COMMA! â ACC!
parses â$(12,000.00)
â as â-12000.00
â, i.e.ACC!
is first applied and thenCOMMA!
is applied on its result.- Additionally,
informats
can be applied on whole line before processing individual values.
- Additionally,
-
Fixed-width text: If users pass the columns locations via the
fixed
keyword argument, the package reads those columns as fixed-width format. For instance, passingfixed = Dict(1=>1:1, 2=>2:2)
helps to parse â10
â as â[1,0]
â. Mixing fixed-width format and delimited format is also allowed. -
Multiple observations per line: The package allows reading more than one observation per line. This can be done by passing the
multiple_obs = true
keyword argument. The multithreading feature (plus some other features) will be switched off if this option is set. -
Fast file writer: The
DLMReader
package exploits thebyrow
function fromInMemoryDatasets.jl
to write delimited files into disk. This enablesDLMReader
to convert values to string using multiple threads. -
Alternative delimiters: User can pass a vector of delimiters to the function. In this case,
filereader
treats any of the passed delimiters as field delimiter.
Benchmarks
The following benchmarks present preliminary results for DLMReader.jl
âs performance compared to the polars
and data.table
packages. The benchmarks are based on the db-benchmark
repository, i.e. they are the time that each package spends reading each file in the aforementioned repository.
There are some remarks about the presented benchmarks, (see InMemoryDatasets
announcement):
-
Each read is done once (this includes compilation time for
DLMReader.jl
) -
The reported times do not include converting columns to pooled vectors (the same for all solutions).
-
I report the total time, and use
fail
when a solution cannot complete a task -
I use a Linux REHL7 machine with 16 cores and
128GB
of memory. -
The results are based on the latest version of the benchmarked packages + the latest PRs submitted to the
db-benchmark
project. -
The OS system cache is freed for
polars
. The reason for this is thatpolars
exploits the OS file cache and this significantly improves thepolars
reading performance, however, this does not affect other solutions.
The groupby
task timing in seconds - smaller is better
Data | DLMReader | polars | DT |
---|---|---|---|
1e7 - 2e0 | 18 | 15 | 11 |
1e7 - 1e1 | 19 | 13 | 5 |
1e7 - 1e2 | 19 | 15 | 3 |
1e8 - 2e0 | 29 | 168 | 132 |
1e8 - 1e1 | 31 | 167 | 83 |
1e8 - 1e2 | 32 | 169 | 53 |
1e9 - 2e0 | 173 | 1624 | 1766 |
1e9 - 1e1 | 174 | 1605 | 1137 |
1e9 - 1e2 | 153 | 1640 | 727 |
The join
task timing in seconds - smaller is better
Data | DLMReader | polars | DT |
---|---|---|---|
1e7 | 26 | 28 | 32 |
1e8 | 61 | 304 | 441 |
1e9 | 269 | fail | fail |