Performance Report: Effect of Reading CSV file on Mergeing two DataFrames

nilshg · November 16, 2023, 10:30am

Rafael means that you should write @btime test_julia_merge($diam1, $diam2) to interpolate the global variables diam1 and diam2 into the benchmark. It doesn’t actually make a difference here.

What’s happening here is that CSV.read by default performs some optimizations when reading in the file - you can tell this from the column types:

julia> typeof.(eachcol(diamonds1))
7-element Vector{DataType}:
 Vector{Int64} (alias for Array{Int64, 1})
 Vector{Float64} (alias for Array{Float64, 1})
 Vector{String} (alias for Array{String, 1})
 Vector{String} (alias for Array{String, 1})
 Vector{String} (alias for Array{String, 1})
 Vector{Float64} (alias for Array{Float64, 1})
 Vector{Float64} (alias for Array{Float64, 1})

versus

julia> typeof.(eachcol(diam1))
7-element Vector{DataType}:
 Vector{Int64} (alias for Array{Int64, 1})
 Vector{Float64} (alias for Array{Float64, 1})
 PooledVector{String15, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String15, UInt32, 1, Array{UInt32, 1}})
 PooledVector{String1, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String1, UInt32, 1, Array{UInt32, 1}})
 PooledVector{String7, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String7, UInt32, 1, Array{UInt32, 1}})
 Vector{Float64} (alias for Array{Float64, 1})
 Vector{Float64} (alias for Array{Float64, 1})

On my machine I see for the DataFrames produced by CSVFiles:

julia> @btime test_julia_merge($diamonds1, $diamonds2);
  2.555 ms (349 allocations: 5.74 MiB)

and indeed with the default options for CSV.read:

julia> @btime test_julia_merge($diam1, $diam2);
  6.472 ms (377 allocations: 5.13 MiB)

but when disabling pooling:

julia> diam1_nopool = CSV.read("diamonds1.csv", DataFrame; pool = false);

julia> @btime test_julia_merge($diam1_nopool, $diam2);
  2.541 ms (349 allocations: 5.95 MiB)

So it might be that your problem is too small to benefit from pooling. There’s enough of a chance here though that this is a missed optimization in DataFrames that I’ll ping @bkamins to see if this is expected.

EDIT: I’ll add a complete MWE which downloads the data and creates the two separate data sets and then reads them back in below if people want to play with this:

Full MWE

julia> using BenchmarkTools, CSV, CSVFiles, DataFrames, Downloads

julia> full_table = CSV.read(Downloads.download(“https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv”), DataFrame);

julia> full_table.ID = 1:nrow(full_table);

julia> CSV.write(“diamonds1.csv”, select(full_table, :ID, :carat, :cut, :color, :clarity, :depth, :table));

julia> CSV.write(“diamonds2.csv”, select(full_table, :ID, :price, :x, :y, :z));

julia> diamonds1=load(“diamonds1.csv”, spacedelim=false, header_exists=true) |> DataFrame;

julia> diamonds2=load(“diamonds2.csv”, spacedelim=false, header_exists=true) |> DataFrame;

julia> diam1 = CSV.read(“diamonds1.csv”, DataFrame);

julia> diam2 = CSV.read(“diamonds2.csv”, DataFrame);

julia> function test_julia_merge(df1,df2)
merged_df = outerjoin(df1, df2, on=:ID)
return merged_df
end;

julia> @btime test_julia_merge($diamonds1, diamonds2);
2.702 ms (349 allocations: 5.74 MiB)

julia> @btime test_julia_merge($diamonds1, $diamonds2);
2.555 ms (349 allocations: 5.74 MiB)

julia> @btime test_julia_merge($diam1, $diam2);
6.472 ms (377 allocations: 5.13 MiB)

Topic		Replies	Views
CSV read performance vs Pandas General Usage	29	8130	May 6, 2019
~7x compliation time regression in the following data reading code Performance	6	426	September 9, 2022
First try seems a bit sluggish Performance	5	618	February 21, 2021
Increase performance in string comparison with sub-dataframes or sub-arrays Performance question	5	879	July 11, 2021
CSV read in is too slow than other language General Usage performance	13	1338	June 21, 2023

Performance Report: Effect of Reading CSV file on Mergeing two DataFrames

Related topics