I would love to ask for some feedback on the tiny package ViewReader that we wrote (basic documentation on the GitHub). It uses a buffered reader in combination with the amazing StringViews(thanks @stevengj!) to do basic file processing (reading lines, splitting lines and parsing numbers) without making new allocs. It’s super basic, but worked very well for our big data.
add https://github.com/rickbeeloo/ViewReader
I wrote a question before to read a 7TB file and the allocs when using eachline significantly slowed down my code.
For now we only have three functions equivalent to base as drop-in replacement:
eachlineV, to iterate over lines in a file
splitV, splitting lines, although just Char delimiters are supported now
parseV, parse integer from a string (or actually a StringView)
Here is a figure of the base eachline vs eachlineV:
X-axis, number of lines in the test file
Y-axis, runtime in seconds
And here is a benchmark of the other functions on my PC:
Reading lines
Base eachline: 1.437 ms (40028 allocations: 1.30 MiB)
View eachline: 296.062 μs (13 allocations: 20.30 KiB)
Splitting lines
Base split: 6.174 ms (120028 allocations: 11.68 MiB)
View split: 1.073 ms (13 allocations: 20.30 KiB)
Number parse
Base parse: 6.114 ms (90016 allocations: 8.62 MiB)
View parse: 1.924 ms (13 allocations: 20.32 KiB)
The speed improvement will mainly depend on the buffer size that is used. As long as allocating the buffer does not exceed the time needed to read the file. We, on average, see a speedup of 5-8x. For our big data, going from 8 hours to 1 hour is a huge gain.
We wonder for example
Are there already existing packages that implement this, that we missed?
Does somebody think this is useful to add to the Julia packages/develop further?
I didn’t know about Parsers.jl, cool! it also takes UInt8 vectors so that indeed would be much better than reinventing the wheel
We didn’t really intend to improve any of the CSV readers. In our field, bioinformatics, we very often have to perform filtering on data first and then parse something from specific lines, like:
for line in eachline(file)
if startswith(line, "X")
data = split(line, '\t')
output[i] = parse(Int64, data[1])
end
end
I think CSV.jl indeed would be much more convenient for pure CSV files. Not familiar with what the best syntax is but as a comparison:
# Just loading the numbs.txt file in CSV (without summing the column)
@btime file = CSV.File(open("../data/numbs.txt"), buffer_in_memory=true, delim='\t', silencewarnings=true)
5.201 ms (401 allocations: 997.03 KiB)
function viewParse(f::String)
c = 0
for line in eachlineV(f)
for item in splitV(line, '\t')
c += parseV(UInt32, item)
end
end
return c
end
@btime viewParse("../data/numbs.txt")
1.730 ms (13 allocations: 20.32 KiB)
I didn’t mean filtering a CSV, I mean like in the example I gave things like if startswith(line, "X"). So we have files like:
A abcacbbacbabcbabcbabc
B abcbabcbacbacb 10 103i 1212
C ixiixixixixixixixixix
And then we want say “if the line starts with B parse the second element to an Int and add it to our sum”. That’s of course just a single example.
I think using a buffered reader with your StringViews is a very minimal change to regular operations using eachline that saves a lot of allocations/time.
Just added a getindex for splitV so it works the same as the base split, however, using the iterator underneath to not have to allocate the array. So this can now be done with:
c = 0
for line in eachlineV("file.txt")
if startswith(line, 'B')
data = splitV(line, '\t')
c += Parsers.parse(UInt32, data[2])
end
end
return c
Note that this is something of an abuse of the implicit contract of getindex that it is O(1).
The generic way to get the n-th element of an Iterator x is something like first(Iterators.drop(x, n-1)). (I feel like we should have an Iterators.nth(x, n) API for this?)
This is a very nice and useful package. I wrote a PR so that your package can work with compressed files.
As said in the PR: I’m dealing with very large files as well which are almost always compressed with gzip
and very often I can’t afford to load all the file content at once before processing the data, so I’m always looking to the fastest way to read files line by line.
I got inspired by the eachline code in base, maybe the whole package could be fully integrated by overloading all the base functions concerned with strings (eachline, split, etc.) to play nicely with Sstringviews.