CSV vs DelimitedFiles vs Numpy

I’ve been working on a project where I need to read specific rows and columns from a data file. To determine the most efficient approach, I conducted benchmarks using CSV.jl, DelimitedFiles.jl, and Numpy in Python. The results were somewhat surprising, and I’m hoping to gain some insights from the community.

In my case, I noticed that both Numpy and DelimitedFiles.jl have similar performance, with execution times around 300 microseconds. However, using CSV.jl in Julia shows a significantly longer execution time, about 3 milliseconds.

Here’s an example of the code I used for each library:

 @benchmark a,b = CSV.File(data_file; skipto=1145,limit=40, comment="#", header=false,ignorerepeated=true, delim=' ') |> data -> (data.Column1, data.Column2)
 @benchmark a, b = readdlm(data_file, skipstart = 1145)[1:40, 1:2] |> x -> (x[:, 1], x[:, 2])
 np = pyimport("numpy")
function python_code(path_to_file::String)
      
           z_p, pdz_p = np.genfromtxt(path_to_file, unpack=true, skip_header=1144, max_rows=40)
           return z_p, pdz_p
       end
@benchmark a,b = python_code(data_file)

as you can see I tried to share the same structure for the codes.
all codes are inside functions (I have not reported those for the julia case).
The final function must read data where I need and return 2 vector.
How is possible that CSV.jl is so slow ?
did I miss something?

ALSO: allocations:
numpy: 6 allocations, 3.55 Kib
Delimited: 617 allocations, 100 Kib
CSV: 78841 allocations, 1.28 Mib

2 Likes

I’m sure others will chime in with more useful answers, but might be a bit of a sledgehammer to crack a nut situation - CSV is great for multi-threaded reading of huge CSV files, not necessarily targeted to ingest tiny amounts of data.

3 Likes

Could you provide the data or characteristics of the data_file?
It seems to me that CSV.File is much (much) faster than readdlm

I can not provide you the file
but here the structure:

#... data 1 percentile .....
0 1 2 3 4 5 6
(I don't need them, 1 row multiple columns)

# ... data 1...
    0.1000  1.825E-029
    0.3000  6.247E-016
    0.5000  3.227E-007
    0.7000  4.726E-008
    0.9000  3.678E-008
... (data that I need, multiple rows, 2 col)
...
#... data 1 percentile .....
0 1 2 3 4 5 6
(I don't need them, 1 row multiple columns)

#.... data 2 ....
(I don't need them)

... and so on
1 Like

This is a test done on a numerical matrix (10^5X5)

julia> open("dlm.txt", "w") do io
           writedlm(io, rand(10^5,5))
       end

julia>  data_file =   raw"dlm.txt"
"dlm.txt"

julia>  @benchmark a,b=CSV.File(data_file; header=false,skipto=1145,limit=40, comment="#", ignorerepeated=true, delim='\t')|>DataFrame|>x->(x.Column1,x.Column2)
BenchmarkTools.Trial: 485 samples with 1 evaluation.
 Range (min … max):   8.370 ms … 21.919 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      9.767 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.314 ms ±  1.607 ms  ┊ GC (mean ± σ):  1.49% ± 4.32%

       ▇  ▁█▁
  ▄▂▂▅███████▃▄▄▄▅▆█▆▅▅▄▃▂▂▃▃▂▃▂▃▃▂▁▁▂▁▃▃▂▂▁▂▂▂▂▁▁▂▁▁▁▂▁▁▁▁▁▂ ▃
  8.37 ms         Histogram: frequency by time        16.8 ms <

 Memory estimate: 3.74 MiB, allocs estimate: 239396.

julia>  @benchmark  a,b= readdlm(data_file,'\t', skipstart = 1145)[1:40, 1:2] |> x -> (x[:, 1], x[:, 2])
BenchmarkTools.Trial: 13 samples with 1 evaluation.
 Range (min … max):  403.675 ms … 423.900 ms  ┊ GC (min … max): 0.80% … 0.62%
 Time  (median):     415.013 ms               ┊ GC (median):    0.63%
 Time  (mean ± σ):   413.840 ms ±   6.812 ms  ┊ GC (mean ± σ):  0.65% ± 0.16%

  ▁  ▁  ▁           ▁▁ ▁            ▁▁           █   ▁▁       ▁
  █▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁██▁█▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁█▁▁▁██▁▁▁▁▁▁▁█ ▁
  404 ms           Histogram: frequency by time          424 ms <

 Memory estimate: 54.46 MiB, allocs estimate: 1481306.
1 Like

could it be the structure of the file? it is not uniform and quite complex

could be. But we should trust you if you don’t give us a more detailed description of the structure of your data:
how many lines?
how many columns?
data types per column

data.toml (29.1 KB)
here a copy a data file that I don’t need
is a text file of 1200 lines that hase the same structure I have described

julia> @benchmark a,b = CSV.File(data_file; skipto=1145,limit=40, comment="#", header=false,ignorerepeated=true, delim=' ') |> data -> (data.Column1, data.Column2)
BenchmarkTools.Trial: 1687 samples with 1 evaluation.
 Range (min … max):  2.755 ms …   6.096 ms  ┊ GC (min … max): 0.00% … 41.54%
 Time  (median):     2.888 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.960 ms ± 273.262 μs  ┊ GC (mean ± σ):  1.26% ±  4.94%

   ▄▇▇█▆▄▄▃▅▄▃▂ ▁                                             ▁
  ███████████████▇▇▇▆▅▁▄▄▅▁▄▁▁▁▁▁▁▁▁▁▄▅▁▁▁▁▁▁▁▁▁▁▁▁▄▅▅▅▅▅▆▆▇▅ █
  2.76 ms      Histogram: log(frequency) by time      4.42 ms <

 Memory estimate: 1.28 MiB, allocs estimate: 78841.

julia> @benchmark a, b = readdlm(data_file, skipstart = 1145)[1:40, 1:2] |> x -> (x[:, 1], x[:, 2])
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  238.100 μs …  3.206 ms  ┊ GC (min … max): 0.00% … 86.76%
 Time  (median):     249.100 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   258.061 μs ± 93.133 μs  ┊ GC (mean ± σ):  1.46% ±  3.66%

   ▂▂▆█▇▄▅▆▅▄▄▅▄▄▃▂▁▁                                          ▂
  █████████████████████▆▆▇▆▆▇▆▇▇▇▇▇▆▇▇▆▆▆▆▆▅▄▄▅▅▅▅▄▅▄▄▄▅▄▅▃▅▄▄ █
  238 μs        Histogram: log(frequency) by time       340 μs <

 Memory estimate: 92.02 KiB, allocs estimate: 615.

This is a weird file, with many extraneous columns. I think CSV.jl is probably spending a lot of time trying to figure out which columns exist and which do not. CSV.jl is best suited for complicated and long CSV files that are nonetheless tabular, in that they have the same number columns for all rows.

I think if performance really matters in this instance you should do some pre-processing so the files are more standardized.

2 Likes

could you measure this on your file?

function rtoml(weirdfile)
    tab=String[]
    open(weirdfile, "r") do lui
        for l in readlines(lui)
            length(l)==22 && push!(tab,l)
        end
    end
    tab
end

 rtoml("data.toml")
function parsetoml(weirdfile)
    tab=[]
    open(weirdfile, "r") do lui
        for l in readlines(lui)
            if length(l)==22 && !startswith("#")(l)
                rn=parse.(Float64,split(lstrip(l)))
                push!(tab,rn...)
            end
        end
    end
    reshape(tab,2,:)'
end

Hi
so, I run the functions on my pc with the file and here the results:

 function rtoml(weirdfile)
           tab=String[]
           open(weirdfile, "r") do lui
               for l in readlines(lui)
                   length(l)==22 && push!(tab,l)
               end
           end
           tab
       end
julia> @benchmark rtoml(data_file)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   96.900 μs …   4.115 ms  ┊ GC (min … max): 0.00% … 92.75%
 Time  (median):     100.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   109.177 μs ± 120.146 μs  ┊ GC (mean ± σ):  4.79% ±  4.23%

  ▃▅█▇▆▅▅▃▂▂▁▂▂▂▁▁                                              ▂
  ████████████████▇▇▇▆▇▅▆▇▇▇▇▆▆▆▄▇▆▆▄▆▅▅▄▁▅▄▆▁▆▄▄▅▅▄▅▃▄▄▅▄▁▄▄▄▅ █
  96.9 μs       Histogram: log(frequency) by time        173 μs <

 Memory estimate: 103.06 KiB, allocs estimate: 2432.

 function parsetoml_fast(weirdfile)
           lines = readlines(weirdfile)
           tab = Float64[]

           for l in lines
               if length(l) == 22 && !startswith(l, "#")
                   rn = parse.(Float64, split(strip(l)))
                   append!(tab, rn)
               end
           end

           reshape(tab, 2, length(tab) ÷ 2)'
       end


 @benchmark parsetoml_fast(data_file)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  409.700 μs …   3.920 ms  ┊ GC (min … max): 0.00% … 86.92%
 Time  (median):     426.200 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   452.244 μs ± 229.126 μs  ┊ GC (mean ± σ):  4.16% ±  7.08%

   ▁▁██▁
  ▄██████▇▅▄▄▃▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▂▂▂ ▃
  410 μs           Histogram: frequency by time          632 μs <

 Memory estimate: 323.73 KiB, allocs estimate: 4235.

However, only the function parsetoml returns a matrix of elements like I want, but is still faster than CSV.

Just for fun, I wrote my version of the function, take in mind that I am not an expert but I find quite interesting the results

function read_matrix_from_file(file_path, start_line, num_lines)
           #tab = Float64[]
           tab = Array{Float64, 2}(undef, num_lines, 2)
           open(file_path, "r") do file
               for i in 1:(start_line - 1)
                   readline(file)  # Skip lines until the start line
               end

               for i in 1:num_lines
                   line = readline(file)
                   values = parse.(Float64, split(strip(line)))
                   tab[i,:] = values #append!(tab, values)
               end
           end

           reshape(tab, 2, num_lines)'
       end

@benchmark read_matrix_from_file(data_file, 1145, 40)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  58.300 μs …  2.433 ms  ┊ GC (min … max): 0.00% … 93.31%
 Time  (median):     60.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   65.575 μs ± 82.224 μs  ┊ GC (mean ± σ):  4.65% ±  3.62%

  ▅▇█▇▆▆▆▄▃▂▁▁      ▁▁▁▂▂▂▂▁                                  ▂
  ██████████████▇█████████████▇▇▆▆▆▆▆▅▄▅▄▄▄▃▃▃▁▆▆▆▆▆▆▆▅▄▆▄▁▅▅ █
  58.3 μs      Histogram: log(frequency) by time      96.3 μs <

 Memory estimate: 68.75 KiB, allocs estimate: 1322.

even if my function allocates more memory than ‘DelimitedFiles.jl’ is still faster.
how ?

to put almost everything lazy

function lazilytoml(weirdfile,l2take=40,l2skip=1140)
    tab = Array{Float64, 2}(undef, l2take, 2)
    open(weirdfile, "r") do lui
        for (i,l) in enumerate(Iterators.take(Iterators.drop(eachline(lui),l2skip),l2take))
            tab[i,:].=parse.(Float64,split(lstrip(l)))
        end
    end
    tab
end

it does not work, the array broadcasting has a shape mismatch

I’m not sure if you are referring to the lazy version and which file applied

Summary
julia> function lazilytoml(weirdfile,l2take=40,l2skip=1140)
           tab = Array{Float64, 2}(undef, l2take, 2)
           open(weirdfile, "r") do lui
               for (i,l) in enumerate(Iterators.take(Iterators.drop(eachline(lui),l2skip),l2take))
                   tab[i,:].=parse.(Float64,split(lstrip(l)))
               end
           end
           tab
       end
lazilytoml (generic function with 3 methods)

julia> using CSV, DataFrames, BenchmarkTools, DelimitedFiles, InlineStrings

julia> lazilytoml("data.toml")
40×2 Matrix{Float64}:
 0.1  9.238e-27
 0.3  1.954e-6
 0.5  6.797e-5
 0.7  0.0008787
 0.9  0.001167
 1.1  0.00387
 1.3  0.01073
 1.5  0.02365
 1.7  0.03189
 1.9  0.04394
 2.1  0.04676
 2.3  0.05888
 2.5  0.06186
 2.7  0.07058
 2.9  0.06109
 3.1  0.05216
 3.3  0.05766
 3.5  0.04022
 3.7  0.04809
 3.9  0.03633
 4.1  0.03223
 4.3  0.03391
 4.5  0.03049
 4.7  0.03354
 4.9  0.02944
 5.1  0.02683
 5.3  0.02231
 5.5  0.02401
 5.7  0.01801
 5.9  0.01859
 6.1  0.01313
 6.3  0.01303
 6.5  0.01024
 6.7  0.009666
 6.9  0.007282
 7.1  0.009005
 7.3  0.006705
 7.5  0.003783
 7.7  0.006358
 7.9  0.001623

yes that was my mistake sorry, I was in an hurry and I misspelled the file name: here the results of the benchmark:

@benchmark lazilytoml(data_file)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  78.300 μs …   3.976 ms  ┊ GC (min … max): 0.00% … 95.33%
 Time  (median):     80.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   88.568 μs ± 127.850 μs  ┊ GC (mean ± σ):  5.93% ±  4.03%

  ▇█▇▆▅▄▃▂▁  ▂▁▁▁▁▁                                            ▂
  ██████████████████▇▇▆▆▄▆▆▅▄▄▅▆█▇▆▅▆▅▆▅▇▅▅▄▅▇█▇▆▅▄▄▁▄▄▄▅▁▄▆▇▇ █
  78.3 μs       Histogram: log(frequency) by time       137 μs <

 Memory estimate: 83.27 KiB, allocs estimate: 2423.```