Reading tab-delimited file & memory allocation

lln · February 18, 2022, 9:51pm

Hi, I have the following code for parsing a tab-delimited file

function read_file(file::String, chrom::Int64)::Dict
    vld_dict::Dict{String, Vector{String}} = Dict("SNP" => String[], "A1" => String[], "A2" => String[])
    nsnps::Int64 = 0
                
    open(file) do f
        while ! eof(f)
            ll = readline(f) .|> s -> split(s, "\t")
            if parse(Int64, ll[1]) == chrom
                nsnps += 1
                push!(vld_dict["SNP"], ll[2])
                push!(vld_dict["A1"], ll[5])
                push!(vld_dict["A2"], ll[6])
            end
        end
    end
    return vld_dict
end

Calling it using @time read_file("test.bim", 22) gives
0.001816 seconds (7.52 k allocations: 496.266 KiB) (for the second run). Is there a way to optimize this to allocate less memory? The file test.bim only has 1000 lines that are chrom = 22

Thanks!

rafael.guerra · February 18, 2022, 10:15pm

A small optimization is to avoid pipes, as they usually allocate more than explicit writing:

ll = split(readline(f) , "\t")

lln · February 18, 2022, 10:19pm

Thanks! That does improve things a bit (0.001276 seconds (6.52 k allocations: 480.641 KiB)

johnmyleswhite · February 19, 2022, 12:27am

Are you using a CSV parser like CSV.jl? Even if you don’t use it, that code can help you understand how to write a fast parser.

One challenge is that Julia does make it a bit harder than it could to avoid building lots of new strings instead of reusing a fixed byte buffer. That’s one of the things optimized parsers handle.

lmiq · February 19, 2022, 12:34am

Wouldn’t a split!(buff, string, char) function be a nice addition to base?

(and a readline!(buff, f) as well).

johnmyleswhite · February 19, 2022, 11:58am

Yes, but I would want something larger scoped: something that lets me use all of the core string functions, but operating on a byte buffer I control and operate on via mutation. And in Base so people don’t forget to add methods to it in the future

It would be like the never ratified string_view for C++.

Topic		Replies	Views
Performance: read data from ascii file, replace `split` General Usage performance	13	289	November 12, 2024
Package to read/process lines without new allocations Package Announcements question , package , announcement	13	1024	May 5, 2023
Read lines from file without new allocations Performance question	1	473	August 19, 2022
CSV vs DelimitedFiles vs Numpy Performance	15	973	January 20, 2024
Skipping a lot of lines in CSV.read() allocates too much memory Performance csv , io	77	2058	February 23, 2024

Reading tab-delimited file & memory allocation

Related topics