Performance: read data from ascii file, replace `split`

Basically I have a very large file that has some data I need to read, with the form, lets say:

AA 1 10.0 
B   22 2.0

I have to use this data to feed an array of a structure

struct Data
    name::String
    index::Int 
    value::Float64
end

So I’m doing something like:

function read_data(filename)
    data = Data[]
    open(filename, "r") do f
        for line in eachline(f)
            values = split(line)
            push!(data, Data(values[1], parse(Int, values[2]), parse(Float64, values[3])))
        end
    end
    return data
end

The thing is: I have profiled my current implementation of the file reading, and the most expensive task, by far, is the split(line) which, additionally, seems to be creating issues because of the memory allocated (this I was not expecting).

Anyway, is there a performant way to do the same without explicitly using split? And possibly avoiding some of the intermediate allocations?

I tried a “manual” version where I loop over the content of the line explicitly to recognize where keywords start and end, but the performance was degraded.

Can Base.eachsplit help with allocations? Also I think similar things have been discussed previously, for example Performance of splitting string and parsing numbers

UPD: another interesting idea for eachsplit is Add method `split(str, dlm, ::Val{N})` for allocation-free splitting by jakobnissen · Pull Request #43557 · JuliaLang/julia · GitHub
" Good news: NTuple{N}(eachsplit(str, delim)) appears to work for this with no allocation :tada:"

2 Likes

Have you tried CSV.jl? You can use it with non-comma files: Examples · CSV.jl

1 Like

Yes, it did. It reduces allocations by half. But (at least with what I tried for now) it the performance dropped. But seems a path forward. Thanks for pointing that out.

I cannot really use it directly, because the file structure is more complicated. But that gave me the idea to try it in simpler case, and it seems that what CSV.jl does is ~10 faster than what I was able to achieve. So I will definitely take a look at how it reads the file and try to imitate it, or use it indirectly in some way.

Have you tried something along these lines as well? Since you know the number of fields you can try specializing on it

function parse_str(str)
    (a,b,c) = NTuple{3}(eachsplit(str, ' '))
    return (a, parse(Int, b), parse(Float64, c))
end
1 Like

Can you somehow treat the text file as fixed width?

I ended up using something like this (I simplified/renamed some stuff and did not try running this exact code):

@with_kw struct Config
    nchar      :: SVector{25, Int64} = SVector{25, Int64}([13; 3; repeat([4], 2); repeat([3], 6); repeat([4], 15)])
    col_end    :: SVector{25, Int64} = cumsum(nchar)
    col_start  :: SVector{25, Int64} = col_end - nchar .+ 1
end

# Skips header and trims blank lines
function readlines2(fname)
    io = open(fname, "r")
        seek(io, 206)
        x = read(io, String)
    close(io)

    return split(x[1:(end - 2)], "\n")
end

function get_intvals(strvec, idx, col_start, col_end)
    res = (parse(Int16, strvec[j][col_start[idx]:col_end[idx]]) for j in eachindex(strvec))
    return res
end

const NROW   = 30
const CONFIG = Config()

# Still allocates, but ~4x less than readlines
strvec = readlines2(fname)

# Does not allocate
@SVector [SVector{NROW}(get_intvals(strvec, idx, CONFIG.col_start, CONFIG.col_end)) for idx in 5:25]

I would love to be able to read/write from file without any allocations but could not figure that one out.

Did it now. That improved the performance and I recovered the time of the split version, with less memory usage. Still much slower than CSV.jl for the same size.

No, no really. Actually I don’t know in advance the number of fields, the fields might be misaligned. At least the number of fields do not change for each file, such that I can do the specialization suggested above when reading one file.

Some other suggestions I have: CSV.jl is multi-threaded by default so this might affect timings. Also it uses GitHub - JuliaStrings/InlineStrings.jl: Fixed-width string types for Julia which helps a lot to avoid allocations. Tried unrolling the loop - this helps a bit but maybe bottleneck now is not in split logic.

@inbounds function parse_str_unrolled(str)
    ix_beg = findfirst((==)(' '), str)
    ix_end = findnext((==)(' '), str, ix_beg+1)
    a = str[1:ix_beg-1]
    b = @views parse(Int, str[ix_beg+1:ix_end-1])
    c = @views parse(Float64, str[ix_end+1:end])
    (a, b, c)
end
1 Like

I think my issue now is to make the parsing type-stable.

If that helps, CSV.jl can parse directly from byte arrays. So, if the problem is that file contains extra data beside table, it may be still faster to mark the beginning and end of the table, read the data in between into Vector{UInt8} and pass it to CSV.jl.

4 Likes

You mean reading the table as a single block of data?

(I don’t think I can handle two copies of the data in memory)

Have you looked at all at Automa.jl? It’s designed for much more complicated formats, but should be usable in this case.

I would also have suggested CSV.jl as a first choice, but if that’s not usable, Automa.jl might be a reasonable second choice

2 Likes

Yes, basically reading only the table part as a contiguous block and passing it as argument.
I guess it is possible to avoid multiple copies of the data using mmap, because this is what CSV.jl itself ultimately does. From the docs:

Any delimited input is ultimately converted to a byte buffer (Vector{UInt8}) for parsing/processing, so with that in mind, let’s look at the various supported input types:

  • File name as a String or FilePath; parsing will call Mmap.mmap(string(file)) to get a byte buffer to the file data. For gzip compressed inputs, like file.gz, the CodecZlib.jl package will be used to decompress the data to a temporary file first, then mmapped to a byte buffer. Decompression can also be done in memory by passing buffer_in_memory=true. Note that only gzip-compressed data is automatically decompressed; for other forms of compressed data, seek out the appropriate package to decompress and pass an IO or Vector{UInt8} of decompressed data as input.
  • Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}}: if you already have a byte buffer from wherever, you can just pass it in directly. If you have a csv-formatted string, you can pass it like CSV.File(IOBuffer(str))
1 Like

The solution had then 3 parts:

  1. using eachsplit as suggested in Performance: read data from ascii file, replace `split` - #2 by artemsolod

  2. using the solution provided by Mason here to parse and set the fields in a type-stable manner: Unroll setfield! - #3 by Mason

  3. Use InlineStrings as indicated in: Performance: read data from ascii file, replace `split` - #8 by artemsolod to reduce the memory footprint of the data structure being created.

The result is then very good. I can read now my 60M data objects in a minute:

julia> @time ats = readCIF("./all.cif")
106.647978 seconds (129.37 M allocations: 12.242 GiB, 14.24% gc time, 0.03% compilation time)
   Array{Atoms,1} with 64423983 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1   NP     PRO     7        1        1  171.946  588.581  135.200  1.00  0.00     0                 1
       2   HC     PRO     7        1        1  172.571  588.749  134.422  1.00  0.00     0                 2
       3   HC     PRO     7        1        1  171.019  588.890  134.923  1.00  0.00     0                 3
                                                       ⋮ 
64423981  CLA     CLA     I       50 20452615  104.220  615.013 -331.799  1.00  0.00     0          64423981
64423982  CLA     CLA     I       51 20452616  130.543  586.064 -347.000  1.00  0.00     0          64423982
64423983  CLA     CLA     I       52 20452617   87.912  628.908 -347.424  1.00  0.00     0          64423983

5 Likes