Package to read/process lines without new allocations

Hey all!

I would love to ask for some feedback on the tiny package ViewReader that we wrote (basic documentation on the GitHub). It uses a buffered reader in combination with the amazing StringViews(thanks @stevengj!) to do basic file processing (reading lines, splitting lines and parsing numbers) without making new allocs. It’s super basic, but worked very well for our big data.

add https://github.com/rickbeeloo/ViewReader

I wrote a question before to read a 7TB file and the allocs when using eachline significantly slowed down my code.

For now we only have three functions equivalent to base as drop-in replacement:

  • eachlineV, to iterate over lines in a file
  • splitV, splitting lines, although just Char delimiters are supported now
  • parseV, parse integer from a string (or actually a StringView)

Here is a figure of the base eachline vs eachlineV:

  • X-axis, number of lines in the test file
  • Y-axis, runtime in seconds
    reader_benchmark

And here is a benchmark of the other functions on my PC:

Reading lines
Base eachline:   1.437 ms (40028 allocations: 1.30 MiB)
View eachline:   296.062 μs (13 allocations: 20.30 KiB)

Splitting lines
Base split:   6.174 ms (120028 allocations: 11.68 MiB)
View split:   1.073 ms (13 allocations: 20.30 KiB)

Number parse
Base parse:   6.114 ms (90016 allocations: 8.62 MiB)
View parse:   1.924 ms (13 allocations: 20.32 KiB)

(here is the file with the code we ran)

The speed improvement will mainly depend on the buffer size that is used. As long as allocating the buffer does not exceed the time needed to read the file. We, on average, see a speedup of 5-8x. For our big data, going from 8 hours to 1 hour is a huge gain.

We wonder for example

  • Are there already existing packages that implement this, that we missed?
  • Does somebody think this is useful to add to the Julia packages/develop further?
  • Anything else that might help to improve it :slight_smile:

Thanks all!

7 Likes

Have you tried the parsers from Parsers.jl? Also, there is CSV.jl if you are reading delimited files.

2 Likes

Hey!

I didn’t know about Parsers.jl, cool! it also takes UInt8 vectors so that indeed would be much better than reinventing the wheel

We didn’t really intend to improve any of the CSV readers. In our field, bioinformatics, we very often have to perform filtering on data first and then parse something from specific lines, like:

for line in eachline(file)
    if startswith(line, "X")
        data = split(line, '\t')
        output[i] = parse(Int64, data[1])
    end 
end

I think CSV.jl indeed would be much more convenient for pure CSV files. Not familiar with what the best syntax is but as a comparison:

# Just loading the numbs.txt file in CSV (without summing the column)
@btime file = CSV.File(open("../data/numbs.txt"), buffer_in_memory=true, delim='\t', silencewarnings=true)
5.201 ms (401 allocations: 997.03 KiB)

function viewParse(f::String) 
    c = 0
    for line in eachlineV(f)
        for item in splitV(line, '\t')
            c += parseV(UInt32, item)
        end
    end
    return c
end

@btime viewParse("../data/numbs.txt")
 1.730 ms (13 allocations: 20.32 KiB)

Did you try reading with CSV keyword argument:

types=UInt32

CSV.jl provides a CSV.Rows iterator that you can use to perform filtering, either with your own loop or via a package like Query.jl. See also this discussion: CSV-row filtering when reading · Issue #503 · JuliaData/CSV.jl · GitHub

1 Like

That slightly reduces the runtime, to 4.670 ms (366 allocations: 819.64 KiB).

I didn’t mean filtering a CSV, I mean like in the example I gave things like if startswith(line, "X"). So we have files like:

A abcacbbacbabcbabcbabc
B abcbabcbacbacb 10 103i 1212
C  ixiixixixixixixixixix

And then we want say “if the line starts with B parse the second element to an Int and add it to our sum”. That’s of course just a single example.

I think using a buffered reader with your StringViews is a very minimal change to regular operations using eachline that saves a lot of allocations/time.

Just added a getindex for splitV so it works the same as the base split, however, using the iterator underneath to not have to allocate the array. So this can now be done with:

c = 0
for line in eachlineV("file.txt")
    if startswith(line, 'B')
        data = splitV(line, '\t') 
        c += Parsers.parse(UInt32, data[2])
    end
end 
return c

Note that this is something of an abuse of the implicit contract of getindex that it is O(1).

The generic way to get the n-th element of an Iterator x is something like first(Iterators.drop(x, n-1)). (I feel like we should have an Iterators.nth(x, n) API for this?)

6 Likes

hmm I was also thinking, this operation only would make sense if:

    1. you are interested in only 1 element in a line (or iterator for that matter)
    1. you could remember the state of the last searched index

Cause it would be quite a “waste” to ask for .nth(x, 100) and .nth(x, 101) if you have to start from state 0 again.

Some discussion about that here: Iterator slicing/indexing · Issue #26 · JuliaCollections/IterTools.jl · GitHub

Hi,

This is a very nice and useful package. I wrote a PR so that your package can work with compressed files.

As said in the PR: I’m dealing with very large files as well which are almost always compressed with gzip
and very often I can’t afford to load all the file content at once before processing the data, so I’m always looking to the fastest way to read files line by line.

I got inspired by the eachline code in base, maybe the whole package could be fully integrated by overloading all the base functions concerned with strings (eachline, split, etc.) to play nicely with Sstringviews.

Amaël

Oh thanks. So nice to see that such a useful feature will be natively part of julia.