[ANN] DLMReader: the most versatile Julia package for reading delimited files yet

I am excited to announce DLMReader, a Julia package for reading delimited files.

Introduction

DLMReader.jl is a multithreaded package for reading delimited files, and it is designed for Julia 1.6+ (64bit OS). The package performance is a trade-off between:

  1. Flexibility
  2. Memory efficiency
  3. Low compilation time
  4. High performance

The package scans the entire file (when limit is set, it may scan the input file twice) to gather information about the delimited file structure. This helps DLMReader to scale well for huge files, however, this is less effective for small files. When information about the file structure is obtained, they are sent to an internal distributer. The distributer starts multiple threads (it uses one thread when threads = false or the file size is small) and distributes the input file among them. Each thread allocates buffsize bytes of memory and starts processing each line of its chunk. Each line is stored as a customised type, called “LineBuffer”, and is sent to the line_informat to be pre-processed for parsing. Each thread then searches for each field in the line that is being processing and sends the raw text of each field to its value informater, and the result is passed to the main parser for parsing and storing. DLMReader uses the Julia base functions for parsing Integers, Reals, Strings, the Dates package for parsing DateTimes, and the UUIDs package for parsing UUIDs.

Features

DLMReader.jl has some interesting features which distinguish it from other packages for reading delimited files. In what follows, I will go through some of those features;

  • Informats: The DLMReader package uses informats to call a class of functions on the raw text before parsing its value(s). This provides a flexible and extendable approach to parse values with special patterns. For instance, using the predefined informat COMMA! allows users to read a numeric column with “thousands separator” and/or the dollar sign, e.g. using this informat, the raw text like “$12,000.00” will be parsed as “12000.00”. Moreover, informats support function composing, e.g. COMMA! ∘ ACC! parses “$(12,000.00)” as “-12000.00”, i.e. ACC! is first applied and then COMMA! is applied on its result.

    • Additionally, informats can be applied on whole line before processing individual values.
  • Fixed-width text: If users pass the columns locations via the fixed keyword argument, the package reads those columns as fixed-width format. For instance, passing fixed = Dict(1=>1:1, 2=>2:2) helps to parse “10” as “[1,0]”. Mixing fixed-width format and delimited format is also allowed.

  • Multiple observations per line: The package allows reading more than one observation per line. This can be done by passing the multiple_obs = true keyword argument. The multithreading feature (plus some other features) will be switched off if this option is set.

  • Fast file writer: The DLMReader package exploits the byrow function from InMemoryDatasets.jl to write delimited files into disk. This enables DLMReader to convert values to string using multiple threads.

  • Alternative delimiters: User can pass a vector of delimiters to the function. In this case, filereader treats any of the passed delimiters as field delimiter.

Benchmarks

The following benchmarks present preliminary results for DLMReader.jl’s performance compared to the polars and data.table packages. The benchmarks are based on the db-benchmark repository, i.e. they are the time that each package spends reading each file in the aforementioned repository.

There are some remarks about the presented benchmarks, (see InMemoryDatasets announcement):

  • Each read is done once (this includes compilation time for DLMReader.jl)

  • The reported times do not include converting columns to pooled vectors (the same for all solutions).

  • I report the total time, and use fail when a solution cannot complete a task

  • I use a Linux REHL7 machine with 16 cores and 128GB of memory.

  • The results are based on the latest version of the benchmarked packages + the latest PRs submitted to the db-benchmark project.

  • The OS system cache is freed for polars. The reason for this is that polars exploits the OS file cache and this significantly improves the polars reading performance, however, this does not affect other solutions.

The groupby task timing in seconds - smaller is better

Data DLMReader polars DT
1e7 - 2e0 18 15 11
1e7 - 1e1 19 13 5
1e7 - 1e2 19 15 3
1e8 - 2e0 29 168 132
1e8 - 1e1 31 167 83
1e8 - 1e2 32 169 53
1e9 - 2e0 173 1624 1766
1e9 - 1e1 174 1605 1137
1e9 - 1e2 153 1640 727

The join task timing in seconds - smaller is better

Data DLMReader polars DT
1e7 26 28 32
1e8 61 304 441
1e9 269 fail fail
24 Likes

I don’t get it.
This is a package that reads delimited files, like CSV.jl.
But your benchmarks are for data manipulation stuff: join and groupby.

What am I getting wrong?

11 Likes

Any reason for not benchmarking against CSV.jl?

5 Likes

why user defined informats must be registered? why not directly use the function itself?

Same here,
I was expecting to view reading and writing times.
Maybe he means the time to read the databases used for the different tasks, i.e. with different sizes and missings, without any further processing.

The benchmark is just about reading files - in db-benchmark repository there are multiple files which needed to be read and analysed, and the benchmarks in this thread only focus on reading csv files.

I had included CSV.jl at some point, but I dropped it because:

  • The CSV.jl is single threaded in the db-benchmark scripts, and it may not be fair to CSV.jl
  • And, it is not a good idea to enable multi-threading in CSV.jl.

Why is it not a good idea to enable multi-threading in CSV.jl?

because DLMReader does not use the user defined function directly, it internally creates a new object from the passed function and sends that particular object to the informater. the register_informat function does this task.

1 Like

You are right;

  • the proposed benchmarks are related to db-benchmark - there are some discussions to include a reading task in those benchmark + since, I have used those benchmarks for InMemoryDatasets, I decided to reuse them for this announcement.
  • There is no missing values in the benchmarked files - this would not change anything for DLMReader, since, the package allocates union of missing for every reading.
  • CSV.jl uses chained vector from SentinelArrays.jl, and random access in chained vectors is expensive, thus, it would slow down the subsequence operations on data sets.
  • Enabling multi-threading for CSV.jl significantly increases the memory usage.
  • Additionally, it slows down the importing files significantly for the benchmarks mentioned in this post.
2 Likes

DLMReader could be used together with InMemoryDatasets.jl to try to beat other solutions (Polars) on the full Database-like ops benchmark

3 Likes

One interesting thing that I noticed during these benchmarkings is that for the 1e9 join task, the combination of InMemoryDatasets and DLMReader finish reading and processing data long before other packages realise that they are failing in the reading part!

2 Likes

I’m sure there are good reasons for registering informats, but I meant why filereader doesn’t do it automatically?

I never been a fan of CSV.jl special array type for reading CSV Files but I never thought it degrades performance!!! :open_mouth:

it’s related with this:

I’m confused?? :confused: I shouldn’t use inlinestring they’r slow?? how’s related to chain vector and random access?

Registering an informat needs compilation, thus, if the filereader function does this, it triggers compilation every time (which is not ideal). However, this also means redefining a function would not change the definition of already registered infomrat.

1 Like

To be precise, my comment about chained vectors would not be an issue for small data sets, it would be a problem in scenarios where user works with large data sets (or in benchmarkings) and mostly for those operations which need random getindex (so using InlineStrings is ok - Actually, DLMReader supports InlineStrings out of the box )

Due to this issue we are using Parsers for parsing Float64 and Float32.