Basically I have a very large file that has some data I need to read, with the form, lets say:
AA 1 10.0
B 22 2.0
I have to use this data to feed an array of a structure
struct Data
name::String
index::Int
value::Float64
end
So I’m doing something like:
function read_data(filename)
data = Data[]
open(filename, "r") do f
for line in eachline(f)
values = split(line)
push!(data, Data(values[1], parse(Int, values[2]), parse(Float64, values[3])))
end
end
return data
end
The thing is: I have profiled my current implementation of the file reading, and the most expensive task, by far, is the split(line) which, additionally, seems to be creating issues because of the memory allocated (this I was not expecting).
Anyway, is there a performant way to do the same without explicitly using split? And possibly avoiding some of the intermediate allocations?
I tried a “manual” version where I loop over the content of the line explicitly to recognize where keywords start and end, but the performance was degraded.
Yes, it did. It reduces allocations by half. But (at least with what I tried for now) it the performance dropped. But seems a path forward. Thanks for pointing that out.
I cannot really use it directly, because the file structure is more complicated. But that gave me the idea to try it in simpler case, and it seems that what CSV.jl does is ~10 faster than what I was able to achieve. So I will definitely take a look at how it reads the file and try to imitate it, or use it indirectly in some way.
Did it now. That improved the performance and I recovered the time of the split version, with less memory usage. Still much slower than CSV.jl for the same size.
No, no really. Actually I don’t know in advance the number of fields, the fields might be misaligned. At least the number of fields do not change for each file, such that I can do the specialization suggested above when reading one file.
Some other suggestions I have: CSV.jl is multi-threaded by default so this might affect timings. Also it uses GitHub - JuliaStrings/InlineStrings.jl: Fixed-width string types for Julia which helps a lot to avoid allocations. Tried unrolling the loop - this helps a bit but maybe bottleneck now is not in split logic.
@inbounds function parse_str_unrolled(str)
ix_beg = findfirst((==)(' '), str)
ix_end = findnext((==)(' '), str, ix_beg+1)
a = str[1:ix_beg-1]
b = @views parse(Int, str[ix_beg+1:ix_end-1])
c = @views parse(Float64, str[ix_end+1:end])
(a, b, c)
end
If that helps, CSV.jl can parse directly from byte arrays. So, if the problem is that file contains extra data beside table, it may be still faster to mark the beginning and end of the table, read the data in between into Vector{UInt8} and pass it to CSV.jl.
Yes, basically reading only the table part as a contiguous block and passing it as argument.
I guess it is possible to avoid multiple copies of the data using mmap, because this is what CSV.jl itself ultimately does. From the docs:
Any delimited input is ultimately converted to a byte buffer (Vector{UInt8}) for parsing/processing, so with that in mind, let’s look at the various supported input types:
File name as a String or FilePath; parsing will call Mmap.mmap(string(file)) to get a byte buffer to the file data. For gzip compressed inputs, like file.gz, the CodecZlib.jl package will be used to decompress the data to a temporary file first, then mmapped to a byte buffer. Decompression can also be done in memory by passing buffer_in_memory=true. Note that only gzip-compressed data is automatically decompressed; for other forms of compressed data, seek out the appropriate package to decompress and pass an IO or Vector{UInt8} of decompressed data as input.
Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}}: if you already have a byte buffer from wherever, you can just pass it in directly. If you have a csv-formatted string, you can pass it like CSV.File(IOBuffer(str))