I’ve been working on a project where I need to read specific rows and columns from a data file. To determine the most efficient approach, I conducted benchmarks using CSV.jl, DelimitedFiles.jl, and Numpy in Python. The results were somewhat surprising, and I’m hoping to gain some insights from the community.
In my case, I noticed that both Numpy and DelimitedFiles.jl have similar performance, with execution times around 300 microseconds. However, using CSV.jl in Julia shows a significantly longer execution time, about 3 milliseconds.
Here’s an example of the code I used for each library:
@benchmark a,b = CSV.File(data_file; skipto=1145,limit=40, comment="#", header=false,ignorerepeated=true, delim=' ') |> data -> (data.Column1, data.Column2)
@benchmark a, b = readdlm(data_file, skipstart = 1145)[1:40, 1:2] |> x -> (x[:, 1], x[:, 2])
np = pyimport("numpy")
function python_code(path_to_file::String)
z_p, pdz_p = np.genfromtxt(path_to_file, unpack=true, skip_header=1144, max_rows=40)
return z_p, pdz_p
end
@benchmark a,b = python_code(data_file)
as you can see I tried to share the same structure for the codes.
all codes are inside functions (I have not reported those for the julia case).
The final function must read data where I need and return 2 vector.
How is possible that CSV.jl is so slow ?
did I miss something?
ALSO: allocations:
numpy: 6 allocations, 3.55 Kib
Delimited: 617 allocations, 100 Kib
CSV: 78841 allocations, 1.28 Mib