CSV vs DelimitedFiles vs Numpy

Matt_jl · January 18, 2024, 2:24pm

I’ve been working on a project where I need to read specific rows and columns from a data file. To determine the most efficient approach, I conducted benchmarks using CSV.jl, DelimitedFiles.jl, and Numpy in Python. The results were somewhat surprising, and I’m hoping to gain some insights from the community.

In my case, I noticed that both Numpy and DelimitedFiles.jl have similar performance, with execution times around 300 microseconds. However, using CSV.jl in Julia shows a significantly longer execution time, about 3 milliseconds.

Here’s an example of the code I used for each library:

 @benchmark a,b = CSV.File(data_file; skipto=1145,limit=40, comment="#", header=false,ignorerepeated=true, delim=' ') |> data -> (data.Column1, data.Column2)

 @benchmark a, b = readdlm(data_file, skipstart = 1145)[1:40, 1:2] |> x -> (x[:, 1], x[:, 2])

 np = pyimport("numpy")
function python_code(path_to_file::String)
      
           z_p, pdz_p = np.genfromtxt(path_to_file, unpack=true, skip_header=1144, max_rows=40)
           return z_p, pdz_p
       end
@benchmark a,b = python_code(data_file)

as you can see I tried to share the same structure for the codes.
all codes are inside functions (I have not reported those for the julia case).
The final function must read data where I need and return 2 vector.
How is possible that CSV.jl is so slow ?
did I miss something?

ALSO: allocations:
numpy: 6 allocations, 3.55 Kib
Delimited: 617 allocations, 100 Kib
CSV: 78841 allocations, 1.28 Mib

nilshg · January 18, 2024, 2:45pm

I’m sure others will chime in with more useful answers, but might be a bit of a sledgehammer to crack a nut situation - CSV is great for multi-threaded reading of huge CSV files, not necessarily targeted to ingest tiny amounts of data.

rocco_sprmnt21 · January 18, 2024, 3:40pm

Could you provide the data or characteristics of the data_file?
It seems to me that CSV.File is much (much) faster than readdlm

Matt_jl · January 18, 2024, 3:48pm

I can not provide you the file
but here the structure:

#... data 1 percentile .....
0 1 2 3 4 5 6
(I don't need them, 1 row multiple columns)

# ... data 1...
    0.1000  1.825E-029
    0.3000  6.247E-016
    0.5000  3.227E-007
    0.7000  4.726E-008
    0.9000  3.678E-008
... (data that I need, multiple rows, 2 col)
...
#... data 1 percentile .....
0 1 2 3 4 5 6
(I don't need them, 1 row multiple columns)

#.... data 2 ....
(I don't need them)

... and so on

rocco_sprmnt21 · January 18, 2024, 3:56pm

This is a test done on a numerical matrix (10^5X5)

julia> open("dlm.txt", "w") do io
           writedlm(io, rand(10^5,5))
       end

julia>  data_file =   raw"dlm.txt"
"dlm.txt"

julia>  @benchmark a,b=CSV.File(data_file; header=false,skipto=1145,limit=40, comment="#", ignorerepeated=true, delim='\t')|>DataFrame|>x->(x.Column1,x.Column2)
BenchmarkTools.Trial: 485 samples with 1 evaluation.
 Range (min … max):   8.370 ms … 21.919 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      9.767 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.314 ms ±  1.607 ms  ┊ GC (mean ± σ):  1.49% ± 4.32%

       ▇  ▁█▁
  ▄▂▂▅███████▃▄▄▄▅▆█▆▅▅▄▃▂▂▃▃▂▃▂▃▃▂▁▁▂▁▃▃▂▂▁▂▂▂▂▁▁▂▁▁▁▂▁▁▁▁▁▂ ▃
  8.37 ms         Histogram: frequency by time        16.8 ms <

 Memory estimate: 3.74 MiB, allocs estimate: 239396.

julia>  @benchmark  a,b= readdlm(data_file,'\t', skipstart = 1145)[1:40, 1:2] |> x -> (x[:, 1], x[:, 2])
BenchmarkTools.Trial: 13 samples with 1 evaluation.
 Range (min … max):  403.675 ms … 423.900 ms  ┊ GC (min … max): 0.80% … 0.62%
 Time  (median):     415.013 ms               ┊ GC (median):    0.63%
 Time  (mean ± σ):   413.840 ms ±   6.812 ms  ┊ GC (mean ± σ):  0.65% ± 0.16%

  ▁  ▁  ▁           ▁▁ ▁            ▁▁           █   ▁▁       ▁
  █▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁██▁█▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁█▁▁▁██▁▁▁▁▁▁▁█ ▁
  404 ms           Histogram: frequency by time          424 ms <

 Memory estimate: 54.46 MiB, allocs estimate: 1481306.

Matt_jl · January 19, 2024, 3:15pm

could it be the structure of the file? it is not uniform and quite complex

rocco_sprmnt21 · January 19, 2024, 3:23pm

could be. But we should trust you if you don’t give us a more detailed description of the structure of your data:
how many lines?
how many columns?
data types per column

Matt_jl · January 19, 2024, 3:50pm

data.toml (29.1 KB)
here a copy a data file that I don’t need
is a text file of 1200 lines that hase the same structure I have described

Matt_jl · January 19, 2024, 3:54pm

julia> @benchmark a,b = CSV.File(data_file; skipto=1145,limit=40, comment="#", header=false,ignorerepeated=true, delim=' ') |> data -> (data.Column1, data.Column2)
BenchmarkTools.Trial: 1687 samples with 1 evaluation.
 Range (min … max):  2.755 ms …   6.096 ms  ┊ GC (min … max): 0.00% … 41.54%
 Time  (median):     2.888 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.960 ms ± 273.262 μs  ┊ GC (mean ± σ):  1.26% ±  4.94%

   ▄▇▇█▆▄▄▃▅▄▃▂ ▁                                             ▁
  ███████████████▇▇▇▆▅▁▄▄▅▁▄▁▁▁▁▁▁▁▁▁▄▅▁▁▁▁▁▁▁▁▁▁▁▁▄▅▅▅▅▅▆▆▇▅ █
  2.76 ms      Histogram: log(frequency) by time      4.42 ms <

 Memory estimate: 1.28 MiB, allocs estimate: 78841.

julia> @benchmark a, b = readdlm(data_file, skipstart = 1145)[1:40, 1:2] |> x -> (x[:, 1], x[:, 2])
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  238.100 μs …  3.206 ms  ┊ GC (min … max): 0.00% … 86.76%
 Time  (median):     249.100 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   258.061 μs ± 93.133 μs  ┊ GC (mean ± σ):  1.46% ±  3.66%

   ▂▂▆█▇▄▅▆▅▄▄▅▄▄▃▂▁▁                                          ▂
  █████████████████████▆▆▇▆▆▇▆▇▇▇▇▇▆▇▇▆▆▆▆▆▅▄▄▅▅▅▅▄▅▄▄▄▅▄▅▃▅▄▄ █
  238 μs        Histogram: log(frequency) by time       340 μs <

 Memory estimate: 92.02 KiB, allocs estimate: 615.

pdeffebach · January 19, 2024, 3:58pm

This is a weird file, with many extraneous columns. I think CSV.jl is probably spending a lot of time trying to figure out which columns exist and which do not. CSV.jl is best suited for complicated and long CSV files that are nonetheless tabular, in that they have the same number columns for all rows.

I think if performance really matters in this instance you should do some pre-processing so the files are more standardized.

rocco_sprmnt21 · January 19, 2024, 9:36pm

could you measure this on your file?

function rtoml(weirdfile)
    tab=String[]
    open(weirdfile, "r") do lui
        for l in readlines(lui)
            length(l)==22 && push!(tab,l)
        end
    end
    tab
end

 rtoml("data.toml")

function parsetoml(weirdfile)
    tab=[]
    open(weirdfile, "r") do lui
        for l in readlines(lui)
            if length(l)==22 && !startswith("#")(l)
                rn=parse.(Float64,split(lstrip(l)))
                push!(tab,rn...)
            end
        end
    end
    reshape(tab,2,:)'
end

Matt_jl · January 20, 2024, 7:29am

Hi
so, I run the functions on my pc with the file and here the results:

 function rtoml(weirdfile)
           tab=String[]
           open(weirdfile, "r") do lui
               for l in readlines(lui)
                   length(l)==22 && push!(tab,l)
               end
           end
           tab
       end
julia> @benchmark rtoml(data_file)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   96.900 μs …   4.115 ms  ┊ GC (min … max): 0.00% … 92.75%
 Time  (median):     100.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   109.177 μs ± 120.146 μs  ┊ GC (mean ± σ):  4.79% ±  4.23%

  ▃▅█▇▆▅▅▃▂▂▁▂▂▂▁▁                                              ▂
  ████████████████▇▇▇▆▇▅▆▇▇▇▇▆▆▆▄▇▆▆▄▆▅▅▄▁▅▄▆▁▆▄▄▅▅▄▅▃▄▄▅▄▁▄▄▄▅ █
  96.9 μs       Histogram: log(frequency) by time        173 μs <

 Memory estimate: 103.06 KiB, allocs estimate: 2432.


 function parsetoml_fast(weirdfile)
           lines = readlines(weirdfile)
           tab = Float64[]

           for l in lines
               if length(l) == 22 && !startswith(l, "#")
                   rn = parse.(Float64, split(strip(l)))
                   append!(tab, rn)
               end
           end

           reshape(tab, 2, length(tab) ÷ 2)'
       end


 @benchmark parsetoml_fast(data_file)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  409.700 μs …   3.920 ms  ┊ GC (min … max): 0.00% … 86.92%
 Time  (median):     426.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   452.244 μs ± 229.126 μs  ┊ GC (mean ± σ):  4.16% ±  7.08%

   ▁▁██▁
  ▄██████▇▅▄▄▃▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▂▂▂ ▃
  410 μs           Histogram: frequency by time          632 μs <

 Memory estimate: 323.73 KiB, allocs estimate: 4235.

However, only the function parsetoml returns a matrix of elements like I want, but is still faster than CSV.

Just for fun, I wrote my version of the function, take in mind that I am not an expert but I find quite interesting the results

function read_matrix_from_file(file_path, start_line, num_lines)
           #tab = Float64[]
           tab = Array{Float64, 2}(undef, num_lines, 2)
           open(file_path, "r") do file
               for i in 1:(start_line - 1)
                   readline(file)  # Skip lines until the start line
               end

               for i in 1:num_lines
                   line = readline(file)
                   values = parse.(Float64, split(strip(line)))
                   tab[i,:] = values #append!(tab, values)
               end
           end

           reshape(tab, 2, num_lines)'
       end

@benchmark read_matrix_from_file(data_file, 1145, 40)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  58.300 μs …  2.433 ms  ┊ GC (min … max): 0.00% … 93.31%
 Time  (median):     60.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   65.575 μs ± 82.224 μs  ┊ GC (mean ± σ):  4.65% ±  3.62%

  ▅▇█▇▆▆▆▄▃▂▁▁      ▁▁▁▂▂▂▂▁                                  ▂
  ██████████████▇█████████████▇▇▆▆▆▆▆▅▄▅▄▄▄▃▃▃▁▆▆▆▆▆▆▆▅▄▆▄▁▅▅ █
  58.3 μs      Histogram: log(frequency) by time      96.3 μs <

 Memory estimate: 68.75 KiB, allocs estimate: 1322.

even if my function allocates more memory than ‘DelimitedFiles.jl’ is still faster.
how ?

rocco_sprmnt21 · January 20, 2024, 3:27pm

to put almost everything lazy

function lazilytoml(weirdfile,l2take=40,l2skip=1140)
    tab = Array{Float64, 2}(undef, l2take, 2)
    open(weirdfile, "r") do lui
        for (i,l) in enumerate(Iterators.take(Iterators.drop(eachline(lui),l2skip),l2take))
            tab[i,:].=parse.(Float64,split(lstrip(l)))
        end
    end
    tab
end

Matt_jl · January 20, 2024, 4:20pm

it does not work, the array broadcasting has a shape mismatch

rocco_sprmnt21 · January 20, 2024, 4:32pm

I’m not sure if you are referring to the lazy version and which file applied

Summary

julia> function lazilytoml(weirdfile,l2take=40,l2skip=1140)
           tab = Array{Float64, 2}(undef, l2take, 2)
           open(weirdfile, "r") do lui
               for (i,l) in enumerate(Iterators.take(Iterators.drop(eachline(lui),l2skip),l2take))
                   tab[i,:].=parse.(Float64,split(lstrip(l)))
               end
           end
           tab
       end
lazilytoml (generic function with 3 methods)

julia> using CSV, DataFrames, BenchmarkTools, DelimitedFiles, InlineStrings

julia> lazilytoml("data.toml")
40×2 Matrix{Float64}:
 0.1  9.238e-27
 0.3  1.954e-6
 0.5  6.797e-5
 0.7  0.0008787
 0.9  0.001167
 1.1  0.00387
 1.3  0.01073
 1.5  0.02365
 1.7  0.03189
 1.9  0.04394
 2.1  0.04676
 2.3  0.05888
 2.5  0.06186
 2.7  0.07058
 2.9  0.06109
 3.1  0.05216
 3.3  0.05766
 3.5  0.04022
 3.7  0.04809
 3.9  0.03633
 4.1  0.03223
 4.3  0.03391
 4.5  0.03049
 4.7  0.03354
 4.9  0.02944
 5.1  0.02683
 5.3  0.02231
 5.5  0.02401
 5.7  0.01801
 5.9  0.01859
 6.1  0.01313
 6.3  0.01303
 6.5  0.01024
 6.7  0.009666
 6.9  0.007282
 7.1  0.009005
 7.3  0.006705
 7.5  0.003783
 7.7  0.006358
 7.9  0.001623

Matt_jl · January 20, 2024, 4:38pm

yes that was my mistake sorry, I was in an hurry and I misspelled the file name: here the results of the benchmark:

@benchmark lazilytoml(data_file)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  78.300 μs …   3.976 ms  ┊ GC (min … max): 0.00% … 95.33%
 Time  (median):     80.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.568 μs ± 127.850 μs  ┊ GC (mean ± σ):  5.93% ±  4.03%

  ▇█▇▆▅▄▃▂▁  ▂▁▁▁▁▁                                            ▂
  ██████████████████▇▇▆▆▄▆▆▅▄▄▅▆█▇▆▅▆▅▆▅▇▅▅▄▅▇█▇▆▅▄▄▁▄▄▄▅▁▄▆▇▇ █
  78.3 μs       Histogram: log(frequency) by time       137 μs <

 Memory estimate: 83.27 KiB, allocs estimate: 2423.```

Topic		Replies	Views
CSV.read extremely slow wrt readtable Data	14	3678	July 27, 2018
Very slow readdlm() General Usage	14	1968	October 2, 2018
CSV Reading (rewrite in C?) Internals & Design	50	5210	October 1, 2018
Simple benchmark for CVS.jl v0.9.11 in Julia 1.6.4 Performance	1	559	November 26, 2021
CSV read performance vs Pandas General Usage	29	8289	May 6, 2019

CSV vs DelimitedFiles vs Numpy

Related topics