Reading tab-delimited file & memory allocation

Hi, I have the following code for parsing a tab-delimited file

function read_file(file::String, chrom::Int64)::Dict
    vld_dict::Dict{String, Vector{String}} = Dict("SNP" => String[], "A1" => String[], "A2" => String[])
    nsnps::Int64 = 0
    open(file) do f
        while ! eof(f)
            ll = readline(f) .|> s -> split(s, "\t")
            if parse(Int64, ll[1]) == chrom
                nsnps += 1
                push!(vld_dict["SNP"], ll[2])
                push!(vld_dict["A1"], ll[5])
                push!(vld_dict["A2"], ll[6])
    return vld_dict

Calling it using @time read_file("test.bim", 22) gives
0.001816 seconds (7.52 k allocations: 496.266 KiB) (for the second run). Is there a way to optimize this to allocate less memory? The file test.bim only has 1000 lines that are chrom = 22


A small optimization is to avoid pipes, as they usually allocate more than explicit writing:

ll = split(readline(f) , "\t")
1 Like

Thanks! That does improve things a bit (0.001276 seconds (6.52 k allocations: 480.641 KiB)

Are you using a CSV parser like CSV.jl? Even if you don’t use it, that code can help you understand how to write a fast parser.

One challenge is that Julia does make it a bit harder than it could to avoid building lots of new strings instead of reusing a fixed byte buffer. That’s one of the things optimized parsers handle.

1 Like

Wouldn’t a split!(buff, string, char) function be a nice addition to base?

(and a readline!(buff, f) as well).

Yes, but I would want something larger scoped: something that lets me use all of the core string functions, but operating on a byte buffer I control and operate on via mutation. And in Base so people don’t forget to add methods to it in the future :slight_smile:

It would be like the never ratified string_view for C++.

1 Like