Performance Report: Effect of Reading CSV file on Mergeing two DataFrames

Sourish · November 16, 2023, 9:11am

Hi

I was trying to merge two data frames - but I realised how we read the csv files - that is going to have a bearing on the same merge function.

using DataFrames, CSV, FileIO, CSVFiles, PyCall, BenchmarkTools
diamonds1=load("diamonds1.csv", spacedelim=false, header_exists=true) |> DataFrame;
diamonds2=load("diamonds2.csv", spacedelim=false, header_exists=true) |> DataFrame;

first(diamonds2,6)

6×5 DataFrame
Row	ID	price	x	y	z
Int64	Int64	Float64	Float64	Float64
1	1	326	3.95	3.98	2.43
2	2	326	3.89	3.84	2.31
3	3	327	4.05	4.07	2.31
4	4	334	4.2	    4.23	2.63
5	5	335	4.34	4.35	2.75
6	6	336	3.94	3.96	2.48

Now I read the same datasets using the CSV.read

diam1 = CSV.read("diamonds1.csv", DataFrame)
diam2 = CSV.read("diamonds2.csv", DataFrame);

Now I write a function to merge two data frames

function test_julia_merge(df1,df2)
    merged_df = outerjoin(df1, df2, on=:ID)
    return merged_df
end

@benchmark test_julia_merge(diamonds1,diamonds2) 
BenchmarkTools.Trial: 2336 samples with 1 evaluation.
 Range (min … max):  1.784 ms …  10.273 ms  ┊ GC (min … max): 0.00% … 70.07%
 Time  (median):     1.969 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.138 ms ± 495.263 μs  ┊ GC (mean ± σ):  7.97% ± 13.29%

   ▂▆▆█▇▇▅▃▃▂▁▁                            ▁  ▂               ▁
  ██████████████▆▆▄▆▄▁▁▁▁▁▄▁▇▄█▆▆▇▅▅▆▆▁▇▇█▇████▇██▇█▇▇▅▇▆▆▇▆▆ █
  1.78 ms      Histogram: log(frequency) by time      3.74 ms <

 Memory estimate: 5.74 MiB, allocs estimate: 381.

@benchmark test_julia_merge(diam1,diam2) 

BenchmarkTools.Trial: 960 samples with 1 evaluation.
 Range (min … max):  4.632 ms …  18.102 ms  ┊ GC (min … max): 0.00% … 46.31%
 Time  (median):     4.976 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.208 ms ± 847.561 μs  ┊ GC (mean ± σ):  2.19% ±  5.25%

     ▂▃█▄▁                                                     
  ▃▄▆█████▅▄▄▄▄▄▄▄▄▃▃▃▂▃▂▂▂▂▂▃▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▂▂▁▂▂▁▂▂ ▃
  4.63 ms         Histogram: frequency by time        7.85 ms <

 Memory estimate: 5.13 MiB, allocs estimate: 409.

Clearly, CSV.read is affecting the performance of the merge significantly.

Then, I compared the performance of merge against Python. Wrote the code using PyCall on the same notebook.

# Measuring time for merging two data frames in Python:
py"""
import pandas as pd

df1 = pd.read_csv("diamonds1.csv")
df2 = pd.read_csv("diamonds2.csv")
    

# Merge DataFrames on the 'ID' column
def merge_df():
    return pd.merge(df1, df2, on='ID', how='outer')

"""
merge_py_df = py"merge_df"
res_py = @benchmark $merge_py_df()

BenchmarkTools.Trial: 1220 samples with 1 evaluation.
 Range (min … max):  3.751 ms …  28.986 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.995 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.093 ms ± 783.180 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▂▆█▇▇▄▆▄▁                                                  
  ▃▇█████████▇▅▅▅▆▄▄▄▄▂▃▃▃▃▃▃▂▂▂▂▃▂▂▁▁▂▂▂▁▁▁▁▁▁▂▂▂▁▁▂▁▁▁▁▂▂▂▂ ▃
  3.75 ms         Histogram: frequency by time        5.63 ms <

 Memory estimate: 128 bytes, allocs estimate: 2.

Clearly Julia’s performance is poorer than Python when we are using CSV.read.

Any thoughts on the same will be a great help.

Best regards,
Sourish

nilshg · November 16, 2023, 9:18am

I don’t think anyone will be able to give any useful advice without the data sets you are using (is it the usual diamonds data used in Kaggle competitions?)

Sourish · November 16, 2023, 9:20am

Yes - it is available in the ggplot2 package in R and in RDatasets.jl as well.

nilshg · November 16, 2023, 9:34am

Can you show how this is split into the two days sets you have? The link I posted is just one csv file holding the complete data set.

rafael.guerra · November 16, 2023, 9:50am

What are the results when the variables are interpolated?

@benchmark test_julia_merge($diamonds1, $diamonds2)
@benchmark test_julia_merge($diam1, $diam2)

Sourish · November 16, 2023, 9:55am

I split the dataset in the following way.

first(diam1,6)
6×7 DataFrame
Row	ID	carat	cut	color	clarity	depth	table
Int64	Float64	String15	String1	String7	Float64	Float64
1	1	0.23	Ideal	E	SI2	61.5	55.0
2	2	0.21	Premium	E	SI1	59.8	61.0
3	3	0.23	Good	E	VS1	56.9	65.0
4	4	0.29	Premium	I	VS2	62.4	58.0
5	5	0.31	Good	J	SI2	63.3	58.0
6	6	0.24	Very Good	J	VVS2	62.8	57.0

and

first(diam2,6)

6×5 DataFrame
Row	ID	price	x	y	z
Int64	Int64	Float64	Float64	Float64
1	1	326	3.95	3.98	2.43
2	2	326	3.89	3.84	2.31
3	3	327	4.05	4.07	2.31
4	4	334	4.2	    4.23	2.63
5	5	335	4.34	4.35	2.75
6	6	336	3.94	3.96	2.48

Sourish · November 16, 2023, 9:57am

Sorry - I am not sure what is your point !

rafael.guerra · November 16, 2023, 10:06am

According to the BenchmarkTools.jl documentation, interpolation is recommended to avoid the problems of benchmarking with globals.

nilshg · November 16, 2023, 10:30am

Rafael means that you should write @btime test_julia_merge($diam1, $diam2) to interpolate the global variables diam1 and diam2 into the benchmark. It doesn’t actually make a difference here.

What’s happening here is that CSV.read by default performs some optimizations when reading in the file - you can tell this from the column types:

julia> typeof.(eachcol(diamonds1))
7-element Vector{DataType}:
 Vector{Int64} (alias for Array{Int64, 1})
 Vector{Float64} (alias for Array{Float64, 1})
 Vector{String} (alias for Array{String, 1})
 Vector{String} (alias for Array{String, 1})
 Vector{String} (alias for Array{String, 1})
 Vector{Float64} (alias for Array{Float64, 1})
 Vector{Float64} (alias for Array{Float64, 1})

versus

julia> typeof.(eachcol(diam1))
7-element Vector{DataType}:
 Vector{Int64} (alias for Array{Int64, 1})
 Vector{Float64} (alias for Array{Float64, 1})
 PooledVector{String15, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String15, UInt32, 1, Array{UInt32, 1}})
 PooledVector{String1, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String1, UInt32, 1, Array{UInt32, 1}})
 PooledVector{String7, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String7, UInt32, 1, Array{UInt32, 1}})
 Vector{Float64} (alias for Array{Float64, 1})
 Vector{Float64} (alias for Array{Float64, 1})

On my machine I see for the DataFrames produced by CSVFiles:

julia> @btime test_julia_merge($diamonds1, $diamonds2);
  2.555 ms (349 allocations: 5.74 MiB)

and indeed with the default options for CSV.read:

julia> @btime test_julia_merge($diam1, $diam2);
  6.472 ms (377 allocations: 5.13 MiB)

but when disabling pooling:

julia> diam1_nopool = CSV.read("diamonds1.csv", DataFrame; pool = false);

julia> @btime test_julia_merge($diam1_nopool, $diam2);
  2.541 ms (349 allocations: 5.95 MiB)

So it might be that your problem is too small to benefit from pooling. There’s enough of a chance here though that this is a missed optimization in DataFrames that I’ll ping @bkamins to see if this is expected.

EDIT: I’ll add a complete MWE which downloads the data and creates the two separate data sets and then reads them back in below if people want to play with this:

Full MWE

julia> using BenchmarkTools, CSV, CSVFiles, DataFrames, Downloads

julia> full_table = CSV.read(Downloads.download(“https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv”), DataFrame);

julia> full_table.ID = 1:nrow(full_table);

julia> CSV.write(“diamonds1.csv”, select(full_table, :ID, :carat, :cut, :color, :clarity, :depth, :table));

julia> CSV.write(“diamonds2.csv”, select(full_table, :ID, :price, :x, :y, :z));

julia> diamonds1=load(“diamonds1.csv”, spacedelim=false, header_exists=true) |> DataFrame;

julia> diamonds2=load(“diamonds2.csv”, spacedelim=false, header_exists=true) |> DataFrame;

julia> diam1 = CSV.read(“diamonds1.csv”, DataFrame);

julia> diam2 = CSV.read(“diamonds2.csv”, DataFrame);

julia> function test_julia_merge(df1,df2)
merged_df = outerjoin(df1, df2, on=:ID)
return merged_df
end;

julia> @btime test_julia_merge($diamonds1, diamonds2);
2.702 ms (349 allocations: 5.74 MiB)

julia> @btime test_julia_merge($diamonds1, $diamonds2);
2.555 ms (349 allocations: 5.74 MiB)

julia> @btime test_julia_merge($diam1, $diam2);
6.472 ms (377 allocations: 5.13 MiB)

Sourish · November 16, 2023, 10:42am

@nilshg Thanks. This solves the problem.

Best regards,
Sourish

nilshg · November 16, 2023, 10:50am

There’s some discussion in this epic issue:

github.com/JuliaData/DataFrames.jl

Faster joins meta issue

opened 05:28PM - 31 Jul 20 UTC

closed 07:22PM - 08 Mar 21 UTC

bkamins

question

Here are the things I think joins can take advantage of to improve performance: … * allow `join`s on `GroupedDataFrame` * taking into account the cardinality of rows of joined tables * being able to take information that the data frame is already sorted on `on` variable * being able to take information that the `on` variable defines unique unique rows in one or both of the joined tables * take into account if the columns are "pooled" (this already happens) * maybe special casing for single-column of bitstype joins * maybe special algorithms for concrete joins (like `innerjoin` which is probably most common) Please comment on what you think or have implemented to move forward with this issue (also keeping in mind that in the long term we might want multi-threading here). Also we should think of memory footprint of the algorithms we use.

but understandably seems to mostly focus on the case in which the pooled vectors are actually the vectors being joined on. I’m actually a little surprised that there’s such a big effect of pooling but it maybe this is related to the fact that the join turns all column eltypes from into T into Union{T, Missing} which maybe is more costly for pooled vectors?

This also seems to be an issue of outerjoin specifically, with leftjoin (which seems more appropriate in this example, althouh I appreciate it’s probably not your real application), I get:

julia> @btime test_julia_merge2($diamonds1, $diamonds2);
  2.017 ms (332 allocations: 5.58 MiB)

julia> @btime test_julia_merge2($diam1, $diam2);
  1.722 ms (356 allocations: 4.96 MiB)

Sourish · November 16, 2023, 3:25pm

okay - I see. make sense.

bkamins · November 16, 2023, 5:50pm

Yes - I think pooling is the issue here. You can disable pooling in CSV.read if you do not need it.

nilshg · November 16, 2023, 6:58pm

I’ve shown above that it is indeed the issue, the question is it’s this expected and if not can something be done about it? Seems like a bit of a footgun that the default optimizations in a standard data science workflow lead to a 3x regression in outerjoin.

bkamins · November 16, 2023, 9:06pm

pool kwarg documentation · Issue #1090 · JuliaData/CSV.jl · GitHub is related, so can you please comment there.

The main issue is that for e.g. String1 pooling is not good most likely as it will use more memory and potentially be slower. For wide StringX types pooling will save memory.

bkamins · November 16, 2023, 10:27pm

also related Always keep missing in pool by bkamins · Pull Request #65 · JuliaData/PooledArrays.jl · GitHub

bkamins · November 17, 2023, 9:06am

nilshg · November 17, 2023, 9:16am

So is it fair to say in summary that the performance hit is expected, people that want to do outerjoins (or maybe other operations which have to change element types) should not use PooledArrays but given that PooledArrays are beneficial to most users in most situations we are happy to have it as the default in CSV.File?

bkamins · November 17, 2023, 4:37pm

Roughly yes. But since the case of outerjoin is common in Always keep missing in pool by bkamins · Pull Request #65 · JuliaData/PooledArrays.jl · GitHub I proposed to always technically keep the option of missing in the pool, so that doing “allowmissing” kind of operations would not lead to recalculation of pool.

Topic		Replies	Views
CSV.read extremely slow wrt readtable Data	14	3678	July 27, 2018
CSV read performance vs Pandas General Usage	29	8290	May 6, 2019
Reading Data Is Still Too Slow Data	35	8968	August 2, 2019
CSV Reading (rewrite in C?) Internals & Design	50	5210	October 1, 2018
Reading and processing Data files concurrently Data parallel	18	3859	September 20, 2017

Performance Report: Effect of Reading CSV file on Mergeing two DataFrames

Related topics