Package to read/process lines without new allocations

CodeGodz · January 3, 2023, 9:41am

Hey all!

I would love to ask for some feedback on the tiny package ViewReader that we wrote (basic documentation on the GitHub). It uses a buffered reader in combination with the amazing StringViews(thanks @stevengj!) to do basic file processing (reading lines, splitting lines and parsing numbers) without making new allocs. It’s super basic, but worked very well for our big data.

add https://github.com/rickbeeloo/ViewReader

I wrote a question before to read a 7TB file and the allocs when using eachline significantly slowed down my code.

For now we only have three functions equivalent to base as drop-in replacement:

eachlineV, to iterate over lines in a file
splitV, splitting lines, although just Char delimiters are supported now
parseV, parse integer from a string (or actually a StringView)

Here is a figure of the base eachline vs eachlineV:

X-axis, number of lines in the test file
Y-axis, runtime in seconds

And here is a benchmark of the other functions on my PC:

Reading lines
Base eachline:   1.437 ms (40028 allocations: 1.30 MiB)
View eachline:   296.062 μs (13 allocations: 20.30 KiB)

Splitting lines
Base split:   6.174 ms (120028 allocations: 11.68 MiB)
View split:   1.073 ms (13 allocations: 20.30 KiB)

Number parse
Base parse:   6.114 ms (90016 allocations: 8.62 MiB)
View parse:   1.924 ms (13 allocations: 20.32 KiB)

(here is the file with the code we ran)

The speed improvement will mainly depend on the buffer size that is used. As long as allocating the buffer does not exceed the time needed to read the file. We, on average, see a speedup of 5-8x. For our big data, going from 8 hours to 1 hour is a huge gain.

We wonder for example

Are there already existing packages that implement this, that we missed?
Does somebody think this is useful to add to the Julia packages/develop further?
Anything else that might help to improve it

Thanks all!

stevengj · January 3, 2023, 1:06pm

Have you tried the parsers from Parsers.jl? Also, there is CSV.jl if you are reading delimited files.

CodeGodz · January 3, 2023, 1:40pm

Hey!

I didn’t know about Parsers.jl, cool! it also takes UInt8 vectors so that indeed would be much better than reinventing the wheel

We didn’t really intend to improve any of the CSV readers. In our field, bioinformatics, we very often have to perform filtering on data first and then parse something from specific lines, like:

for line in eachline(file)
    if startswith(line, "X")
        data = split(line, '\t')
        output[i] = parse(Int64, data[1])
    end 
end

I think CSV.jl indeed would be much more convenient for pure CSV files. Not familiar with what the best syntax is but as a comparison:

# Just loading the numbs.txt file in CSV (without summing the column)
@btime file = CSV.File(open("../data/numbs.txt"), buffer_in_memory=true, delim='\t', silencewarnings=true)
5.201 ms (401 allocations: 997.03 KiB)

function viewParse(f::String) 
    c = 0
    for line in eachlineV(f)
        for item in splitV(line, '\t')
            c += parseV(UInt32, item)
        end
    end
    return c
end

@btime viewParse("../data/numbs.txt")
 1.730 ms (13 allocations: 20.32 KiB)

rafael.guerra · January 3, 2023, 1:48pm

Did you try reading with CSV keyword argument:

types=UInt32

stevengj · January 3, 2023, 1:49pm

CSV.jl provides a CSV.Rows iterator that you can use to perform filtering, either with your own loop or via a package like Query.jl. See also this discussion: CSV-row filtering when reading · Issue #503 · JuliaData/CSV.jl · GitHub

CodeGodz · January 3, 2023, 2:03pm

That slightly reduces the runtime, to 4.670 ms (366 allocations: 819.64 KiB).

CodeGodz · January 3, 2023, 2:06pm

I didn’t mean filtering a CSV, I mean like in the example I gave things like if startswith(line, "X"). So we have files like:

A abcacbbacbabcbabcbabc
B abcbabcbacbacb 10 103i 1212
C  ixiixixixixixixixixix

And then we want say “if the line starts with B parse the second element to an Int and add it to our sum”. That’s of course just a single example.

I think using a buffered reader with your StringViews is a very minimal change to regular operations using eachline that saves a lot of allocations/time.

CodeGodz · January 3, 2023, 2:40pm

Just added a getindex for splitV so it works the same as the base split, however, using the iterator underneath to not have to allocate the array. So this can now be done with:

c = 0
for line in eachlineV("file.txt")
    if startswith(line, 'B')
        data = splitV(line, '\t') 
        c += Parsers.parse(UInt32, data[2])
    end
end 
return c

stevengj · January 3, 2023, 3:00pm

Note that this is something of an abuse of the implicit contract of getindex that it is O(1).

The generic way to get the n-th element of an Iterator x is something like first(Iterators.drop(x, n-1)). (I feel like we should have an Iterators.nth(x, n) API for this?)

CodeGodz · January 3, 2023, 3:25pm

hmm I was also thinking, this operation only would make sense if:

1. you are interested in only 1 element in a line (or iterator for that matter)
1. you could remember the state of the last searched index

Cause it would be quite a “waste” to ask for .nth(x, 100) and .nth(x, 101) if you have to start from state 0 again.

CodeGodz · January 3, 2023, 3:30pm

Some discussion about that here: Iterator slicing/indexing · Issue #26 · JuliaCollections/IterTools.jl · GitHub

amael · May 5, 2023, 10:08am

Hi,

This is a very nice and useful package. I wrote a PR so that your package can work with compressed files.

As said in the PR: I’m dealing with very large files as well which are almost always compressed with gzip
and very often I can’t afford to load all the file content at once before processing the data, so I’m always looking to the fastest way to read files line by line.

I got inspired by the eachline code in base, maybe the whole package could be fully integrated by overloading all the base functions concerned with strings (eachline, split, etc.) to play nicely with Sstringviews.

Amaël

stevengj · May 5, 2023, 11:29am

github.com/JuliaLang/julia

copyuntil(out::IO, in::IO, delim)

JuliaLang:master ← JuliaLang:sgj/readuntil_inplace

opened 01:54AM - 14 Jan 23 UTC

stevengj

+315 -71

This PR defines and exports ~~a new function `readuntil!(s::IO, buffer::Abstrac…tVector{UInt8}, delim)`~~ new functions: ```jl copyuntil(out::IO, s::IO, delim; keep=false) copyline(out::IO, s::IO; keep=false) ``` that read data from `s` into ~~`buffer` in-place (resized if needed)~~ the `out` stream until `delim` is read/written or the end of the stream is reached. The PR was inspired by [this post](https://discourse.julialang.org/t/is-julia-well-suited-for-string-manipulation/92943/8?u=stevengj) from @jakobnissen: the goal is to make it easier implement an allocation-free `eachline` iterator. This can be done in a package, given `readuntil` with an `IOBuffer`, using the [`StringViews.jl` package](https://github.com/JuliaStrings/StringViews.jl) to return a string view of the in-place `buffer` on each iteration. The reason it seemed like this needed a `Base` function, instead of living completely in a package, is that `readuntil` relies on a [low-level `jl_readuntil` C function](https://github.com/JuliaLang/julia/blob/428d242f9b8808e48905c7e82dede0e605f163cb/src/sys.c#L261-L317) that would be difficult to replicate in a package. To obtain comparable performance, it seems like we need an analogous `jl_readuntil_buf` method (implemented in this PR) and a corresponding Julia API. Moreover, relatively little new code was required because many of the existing `readuntil` methods used an `IOBuffer` internally, so it was merely a matter of refactoring and exporting this functionality. Also, we already had an optimized [ios_copyuntil function](https://github.com/JuliaLang/julia/blob/8a9589d5a5d9f6bbc3dd8cbcdfb93fa03527c796/src/support/ios.c#L832-L859) for copying between IOStreams, which can now be exported in the new API. To do: - [x] Fix bootstrapping failures - [x] Benchmarks. (Maybe it is faster just to read the file in 4k blocks into a buffer with `readbytes!` and then return StringViews on top of that? This could happen completely in a package. It's a *lot* easier to use something like `readuntil!`, however.) - [x] Tests - [x] More docs - [x] NEWS - [x] Fixes and tests for new `out::IO` variant - [x] add `readline(out::IO, in::IO)` too? - [x] more tests for `readline(out::IO, in::IO)` methods - [x] more benchmarks and optimization Before I do much more work on this, what do people think?

amael · May 5, 2023, 1:14pm

Oh thanks. So nice to see that such a useful feature will be natively part of julia.

Topic		Replies	Views
Read lines from file without new allocations Performance question	1	472	August 19, 2022
Reading tab-delimited file & memory allocation New to Julia memory-allocation , io	5	809	February 19, 2022
Performance: read data from ascii file, replace `split` General Usage performance	13	288	November 12, 2024
[ANN] DLMReader 0.4.5 with one Big Enhancement Package Announcements csv , ttfp , ttfx , inmemorydatasets , latency	3	721	July 11, 2022
Somewhat faster text/numeric io General Usage	1	869	September 21, 2017

Package to read/process lines without new allocations

Related topics