Hi, I have the following code for parsing a tab-delimited file
function read_file(file::String, chrom::Int64)::Dict
vld_dict::Dict{String, Vector{String}} = Dict("SNP" => String[], "A1" => String[], "A2" => String[])
nsnps::Int64 = 0
open(file) do f
while ! eof(f)
ll = readline(f) .|> s -> split(s, "\t")
if parse(Int64, ll[1]) == chrom
nsnps += 1
push!(vld_dict["SNP"], ll[2])
push!(vld_dict["A1"], ll[5])
push!(vld_dict["A2"], ll[6])
end
end
end
return vld_dict
end
Calling it using @time read_file("test.bim", 22) gives 0.001816 seconds (7.52 k allocations: 496.266 KiB) (for the second run). Is there a way to optimize this to allocate less memory? The file test.bim only has 1000 lines that are chrom = 22
Are you using a CSV parser like CSV.jl? Even if you don’t use it, that code can help you understand how to write a fast parser.
One challenge is that Julia does make it a bit harder than it could to avoid building lots of new strings instead of reusing a fixed byte buffer. That’s one of the things optimized parsers handle.
Yes, but I would want something larger scoped: something that lets me use all of the core string functions, but operating on a byte buffer I control and operate on via mutation. And in Base so people don’t forget to add methods to it in the future
It would be like the never ratified string_view for C++.