[ANN] DLMReader 0.4.5 with one Big Enhancement

sl-solution · July 4, 2022, 8:22am

I am excited to announce that DLMReader version 0.4.5 has been released. The new release includes several performance enhancements and bug fixes. For instance, reading multiple observations per line and type detecting are allocating less and performing better.

However, the biggest enhancement of the 0.4.5 release is the significant reduction in the time to the first read. I have managed to reduce the package latency in Mac and Linux more than 4 times in the case of reading small files with homogeneous columns - we are using Parsers.jl for parsing floats in Windows (see issue #5) and this causes a little lag compared to other OSs, but the Windows users will still notice significant reduction in the time to the first read.

To achieve this I have implemented a new algorithm for parsing small files. The new algorithm consumes more memory compared to the current high performance algorithm, however, since it is only used for small files, the extra memory usage is unobtrusive. By default, the filereader function calls the new algorithm for files less than 64MiB, and automatically switches to the high performance algorithm for larger files.

The following code demonstrates the latency of the new release in a fresh Julia session (Mac OS - Julia 1.7.3) - note that more than 20% of the reported time is for compiling/inferring non-DLMReader functions, thus, in real life scenarios users would expect even less latency.

julia> jcmd = Base.julia_cmd();

julia> cmd = """
       start_time = time()
       using DLMReader
       filereader(IOBuffer("x1,x2\n1,2\n"))
       println(time() - start_time)
       """;
julia> run(`$jcmd -e $cmd`); # 2.135 sec(homogeneous types)

julia> cmd = """
       start_time = time()
       using DLMReader
       filereader(IOBuffer("x1,x2\n1,2.0\n"))
       println(time() - start_time)
       """;
julia> run(`$jcmd -e $cmd`); # 2.644 sec(heterogeneous types)

julia> cmd = """
              using InMemoryDatasets
              ds = Dataset(rand(10,2), :auto)
              start_time = time()
              using DLMReader
              filereader(IOBuffer("x1,x2\n1,2\n"))
              println(time() - start_time)
              """;
julia> run(`$jcmd -e $cmd`); # 1.699 sec(homogeneous types/real life scenarios)

aplavin · July 4, 2022, 8:57am

That looks great! A major issue of CSV.jl is long time-to-first-read, and DLMReader seems to fare much better on this front.

xinchin · July 9, 2022, 12:23am

but why still 1.699 sec for such a simple function?

sl-solution · July 11, 2022, 8:43am

I agree that it is not yet perfect^[1], however, note that the function is not as simple as it looks. The filereader function consists of multiple functions and Julia needs to compile/infer all of them into a rather big function even for the simplest case.

And I plan to work on this further to see how much I can improve it ↩︎

Topic		Replies	Views
[ANN] DLMReader: the most versatile Julia package for reading delimited files yet Package Announcements csv , inmemorydatasets	19	2600	June 9, 2022
Very slow readdlm() General Usage	14	1913	October 2, 2018
Most of the time spent in `readdlm` a txt file Performance io	1	597	August 9, 2021
CSV vs DelimitedFiles vs Numpy Performance	15	974	January 20, 2024
CSV read performance vs Pandas General Usage	29	8158	May 6, 2019

[ANN] DLMReader 0.4.5 with one Big Enhancement

Related topics