[ANN] DLMReader 0.4.5 with one Big Enhancement

I am excited to announce that DLMReader version 0.4.5 has been released. The new release includes several performance enhancements and bug fixes. For instance, reading multiple observations per line and type detecting are allocating less and performing better.

However, the biggest enhancement of the 0.4.5 release is the significant reduction in the time to the first read. I have managed to reduce the package latency in Mac and Linux more than 4 times in the case of reading small files with homogeneous columns - we are using Parsers.jl for parsing floats in Windows (see issue #5) and this causes a little lag compared to other OSs, but the Windows users will still notice significant reduction in the time to the first read.

To achieve this I have implemented a new algorithm for parsing small files. The new algorithm consumes more memory compared to the current high performance algorithm, however, since it is only used for small files, the extra memory usage is unobtrusive. By default, the filereader function calls the new algorithm for files less than 64MiB, and automatically switches to the high performance algorithm for larger files.

The following code demonstrates the latency of the new release in a fresh Julia session (Mac OS - Julia 1.7.3) - note that more than 20% of the reported time is for compiling/inferring non-DLMReader functions, thus, in real life scenarios users would expect even less latency.

julia> jcmd = Base.julia_cmd();

julia> cmd = """
       start_time = time()
       using DLMReader
       filereader(IOBuffer("x1,x2\n1,2\n"))
       println(time() - start_time)
       """;
julia> run(`$jcmd -e $cmd`); # 2.135 sec(homogeneous types)

julia> cmd = """
       start_time = time()
       using DLMReader
       filereader(IOBuffer("x1,x2\n1,2.0\n"))
       println(time() - start_time)
       """;
julia> run(`$jcmd -e $cmd`); # 2.644 sec(heterogeneous types)

julia> cmd = """
              using InMemoryDatasets
              ds = Dataset(rand(10,2), :auto)
              start_time = time()
              using DLMReader
              filereader(IOBuffer("x1,x2\n1,2\n"))
              println(time() - start_time)
              """;
julia> run(`$jcmd -e $cmd`); # 1.699 sec(homogeneous types/real life scenarios)
12 Likes

That looks great! A major issue of CSV.jl is long time-to-first-read, and DLMReader seems to fare much better on this front.

1 Like

:clap: but why still 1.699 sec for such a simple function?

1 Like

I agree that it is not yet perfect[1], however, note that the function is not as simple as it looks. The filereader function consists of multiple functions and Julia needs to compile/infer all of them into a rather big function even for the simplest case.


  1. And I plan to work on this further to see how much I can improve it ↩ī¸Ž