[ANN] DLMReader: the most versatile Julia package for reading delimited files yet

sl-solution · May 30, 2022, 6:02am

I am excited to announce DLMReader, a Julia package for reading delimited files.

Introduction

DLMReader.jl is a multithreaded package for reading delimited files, and it is designed for Julia 1.6+ (64bit OS). The package performance is a trade-off between:

Flexibility
Memory efficiency
Low compilation time
High performance

The package scans the entire file (when limit is set, it may scan the input file twice) to gather information about the delimited file structure. This helps DLMReader to scale well for huge files, however, this is less effective for small files. When information about the file structure is obtained, they are sent to an internal distributer. The distributer starts multiple threads (it uses one thread when threads = false or the file size is small) and distributes the input file among them. Each thread allocates buffsize bytes of memory and starts processing each line of its chunk. Each line is stored as a customised type, called “LineBuffer”, and is sent to the line_informat to be pre-processed for parsing. Each thread then searches for each field in the line that is being processing and sends the raw text of each field to its value informater, and the result is passed to the main parser for parsing and storing. DLMReader uses the Julia base functions for parsing Integers, Reals, Strings, the Dates package for parsing DateTimes, and the UUIDs package for parsing UUIDs.

Features

DLMReader.jl has some interesting features which distinguish it from other packages for reading delimited files. In what follows, I will go through some of those features;

Informats: The DLMReader package uses informats to call a class of functions on the raw text before parsing its value(s). This provides a flexible and extendable approach to parse values with special patterns. For instance, using the predefined informat COMMA! allows users to read a numeric column with “thousands separator” and/or the dollar sign, e.g. using this informat, the raw text like “$12,000.00” will be parsed as “12000.00”. Moreover, informats support function composing, e.g. COMMA! ∘ ACC! parses “$(12,000.00)” as “-12000.00”, i.e. ACC! is first applied and then COMMA! is applied on its result.
- Additionally, informats can be applied on whole line before processing individual values.
Fixed-width text: If users pass the columns locations via the fixed keyword argument, the package reads those columns as fixed-width format. For instance, passing fixed = Dict(1=>1:1, 2=>2:2) helps to parse “10” as “[1,0]”. Mixing fixed-width format and delimited format is also allowed.
Multiple observations per line: The package allows reading more than one observation per line. This can be done by passing the multiple_obs = true keyword argument. The multithreading feature (plus some other features) will be switched off if this option is set.
Fast file writer: The DLMReader package exploits the byrow function from InMemoryDatasets.jl to write delimited files into disk. This enables DLMReader to convert values to string using multiple threads.
Alternative delimiters: User can pass a vector of delimiters to the function. In this case, filereader treats any of the passed delimiters as field delimiter.

Benchmarks

The following benchmarks present preliminary results for DLMReader.jl’s performance compared to the polars and data.table packages. The benchmarks are based on the db-benchmark repository, i.e. they are the time that each package spends reading each file in the aforementioned repository.

There are some remarks about the presented benchmarks, (see InMemoryDatasets announcement):

Each read is done once (this includes compilation time for DLMReader.jl)
The reported times do not include converting columns to pooled vectors (the same for all solutions).
I report the total time, and use fail when a solution cannot complete a task
I use a Linux REHL7 machine with 16 cores and 128GB of memory.
The results are based on the latest version of the benchmarked packages + the latest PRs submitted to the db-benchmark project.
The OS system cache is freed for polars. The reason for this is that polars exploits the OS file cache and this significantly improves the polars reading performance, however, this does not affect other solutions.

The groupby task timing in seconds - smaller is better

Data	DLMReader	polars	DT
1e7 - 2e0	18	15	11
1e7 - 1e1	19	13	5
1e7 - 1e2	19	15	3

1e8 - 2e0	29	168	132
1e8 - 1e1	31	167	83
1e8 - 1e2	32	169	53

1e9 - 2e0	173	1624	1766
1e9 - 1e1	174	1605	1137
1e9 - 1e2	153	1640	727

The join task timing in seconds - smaller is better

Data	DLMReader	polars	DT
1e7	26	28	32
1e8	61	304	441
1e9	269	fail	fail

Storopoli · May 30, 2022, 4:18pm

I don’t get it.
This is a package that reads delimited files, like CSV.jl.
But your benchmarks are for data manipulation stuff: join and groupby.

What am I getting wrong?

vtomar · May 31, 2022, 3:22am

Any reason for not benchmarking against CSV.jl?

xinchin · May 31, 2022, 11:22pm

why user defined informats must be registered? why not directly use the function itself?

Juan · June 1, 2022, 12:25am

Same here,
I was expecting to view reading and writing times.
Maybe he means the time to read the databases used for the different tasks, i.e. with different sizes and missings, without any further processing.

sl-solution · June 1, 2022, 8:02am

The benchmark is just about reading files - in db-benchmark repository there are multiple files which needed to be read and analysed, and the benchmarks in this thread only focus on reading csv files.

sl-solution · June 1, 2022, 8:23am

I had included CSV.jl at some point, but I dropped it because:

The CSV.jl is single threaded in the db-benchmark scripts, and it may not be fair to CSV.jl
And, it is not a good idea to enable multi-threading in CSV.jl.

nilshg · June 1, 2022, 8:27am

Why is it not a good idea to enable multi-threading in CSV.jl?

sl-solution · June 1, 2022, 8:28am

because DLMReader does not use the user defined function directly, it internally creates a new object from the passed function and sends that particular object to the informater. the register_informat function does this task.

sl-solution · June 1, 2022, 8:37am

You are right;

the proposed benchmarks are related to db-benchmark - there are some discussions to include a reading task in those benchmark + since, I have used those benchmarks for InMemoryDatasets, I decided to reuse them for this announcement.
There is no missing values in the benchmarked files - this would not change anything for DLMReader, since, the package allocates union of missing for every reading.

sl-solution · June 1, 2022, 9:59am

CSV.jl uses chained vector from SentinelArrays.jl, and random access in chained vectors is expensive, thus, it would slow down the subsequence operations on data sets.
Enabling multi-threading for CSV.jl significantly increases the memory usage.
Additionally, it slows down the importing files significantly for the benchmarks mentioned in this post.

Juan · June 1, 2022, 11:25pm

DLMReader could be used together with InMemoryDatasets.jl to try to beat other solutions (Polars) on the full Database-like ops benchmark

sl-solution · June 2, 2022, 7:32am

One interesting thing that I noticed during these benchmarkings is that for the 1e9 join task, the combination of InMemoryDatasets and DLMReader finish reading and processing data long before other packages realise that they are failing in the reading part!

xinchin · June 2, 2022, 10:22pm

I’m sure there are good reasons for registering informats, but I meant why filereader doesn’t do it automatically?

xinchin · June 2, 2022, 10:28pm

I never been a fan of CSV.jl special array type for reading CSV Files but I never thought it degrades performance!!!

Juan · June 4, 2022, 12:55am

it’s related with this:

xinchin · June 6, 2022, 12:05am

I’m confused?? I shouldn’t use inlinestring they’r slow?? how’s related to chain vector and random access?

sl-solution · June 7, 2022, 9:37am

Registering an informat needs compilation, thus, if the filereader function does this, it triggers compilation every time (which is not ideal). However, this also means redefining a function would not change the definition of already registered infomrat.

sl-solution · June 7, 2022, 9:49am

To be precise, my comment about chained vectors would not be an issue for small data sets, it would be a problem in scenarios where user works with large data sets (or in benchmarkings) and mostly for those operations which need random getindex (so using InlineStrings is ok - Actually, DLMReader supports InlineStrings out of the box )

sl-solution · June 9, 2022, 7:32am

Due to this issue we are using Parsers for parsing Float64 and Float32.

Topic		Replies	Views
[ANN] DLMReader 0.4.5 with one Big Enhancement Package Announcements csv , ttfp , ttfx , inmemorydatasets , latency	3	722	July 11, 2022
[ANN] TableReader.jl - A fast and simple CSV parser Package Announcements package , announcement , data , csv	24	5889	March 28, 2019
CSV read performance vs Pandas General Usage	29	8158	May 6, 2019
CSV vs DelimitedFiles vs Numpy Performance	15	974	January 20, 2024
Problem in reading data with DLMreader package General Usage question , package	0	127	May 19, 2023

[ANN] DLMReader: the most versatile Julia package for reading delimited files yet

Introduction

Features

Benchmarks

Related topics