Performance Report: Effect of Reading CSV file on Mergeing two DataFrames

Hi

I was trying to merge two data frames - but I realised how we read the csv files - that is going to have a bearing on the same merge function.

using DataFrames, CSV, FileIO, CSVFiles, PyCall, BenchmarkTools
diamonds1=load("diamonds1.csv", spacedelim=false, header_exists=true) |> DataFrame;
diamonds2=load("diamonds2.csv", spacedelim=false, header_exists=true) |> DataFrame;
first(diamonds2,6)

6Γ—5 DataFrame
Row	ID	price	x	y	z
Int64	Int64	Float64	Float64	Float64
1	1	326	3.95	3.98	2.43
2	2	326	3.89	3.84	2.31
3	3	327	4.05	4.07	2.31
4	4	334	4.2	    4.23	2.63
5	5	335	4.34	4.35	2.75
6	6	336	3.94	3.96	2.48

Now I read the same datasets using the CSV.read

diam1 = CSV.read("diamonds1.csv", DataFrame)
diam2 = CSV.read("diamonds2.csv", DataFrame);

Now I write a function to merge two data frames

function test_julia_merge(df1,df2)
    merged_df = outerjoin(df1, df2, on=:ID)
    return merged_df
end
@benchmark test_julia_merge(diamonds1,diamonds2) 
BenchmarkTools.Trial: 2336 samples with 1 evaluation.
 Range (min … max):  1.784 ms …  10.273 ms  β”Š GC (min … max): 0.00% … 70.07%
 Time  (median):     1.969 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.138 ms Β± 495.263 ΞΌs  β”Š GC (mean Β± Οƒ):  7.97% Β± 13.29%

   β–‚β–†β–†β–ˆβ–‡β–‡β–…β–ƒβ–ƒβ–‚β–β–                            ▁  β–‚               ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–„β–†β–„β–β–β–β–β–β–„β–β–‡β–„β–ˆβ–†β–†β–‡β–…β–…β–†β–†β–β–‡β–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–…β–‡β–†β–†β–‡β–†β–† β–ˆ
  1.78 ms      Histogram: log(frequency) by time      3.74 ms <

 Memory estimate: 5.74 MiB, allocs estimate: 381.
@benchmark test_julia_merge(diam1,diam2) 

BenchmarkTools.Trial: 960 samples with 1 evaluation.
 Range (min … max):  4.632 ms …  18.102 ms  β”Š GC (min … max): 0.00% … 46.31%
 Time  (median):     4.976 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   5.208 ms Β± 847.561 ΞΌs  β”Š GC (mean Β± Οƒ):  2.19% Β±  5.25%

     β–‚β–ƒβ–ˆβ–„β–                                                     
  β–ƒβ–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–„β–„β–„β–„β–„β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–‚β–‚β–‚β–β–‚β–‚β–β–‚β–‚ β–ƒ
  4.63 ms         Histogram: frequency by time        7.85 ms <

 Memory estimate: 5.13 MiB, allocs estimate: 409.

Clearly, CSV.read is affecting the performance of the merge significantly.

Then, I compared the performance of merge against Python. Wrote the code using PyCall on the same notebook.

# Measuring time for merging two data frames in Python:
py"""
import pandas as pd

df1 = pd.read_csv("diamonds1.csv")
df2 = pd.read_csv("diamonds2.csv")
    

# Merge DataFrames on the 'ID' column
def merge_df():
    return pd.merge(df1, df2, on='ID', how='outer')

"""
merge_py_df = py"merge_df"
res_py = @benchmark $merge_py_df()
BenchmarkTools.Trial: 1220 samples with 1 evaluation.
 Range (min … max):  3.751 ms …  28.986 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.995 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.093 ms Β± 783.180 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–‚β–†β–ˆβ–‡β–‡β–„β–†β–„β–                                                  
  β–ƒβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–…β–…β–†β–„β–„β–„β–„β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–ƒβ–‚β–‚β–β–β–‚β–‚β–‚β–β–β–β–β–β–β–‚β–‚β–‚β–β–β–‚β–β–β–β–β–‚β–‚β–‚β–‚ β–ƒ
  3.75 ms         Histogram: frequency by time        5.63 ms <

 Memory estimate: 128 bytes, allocs estimate: 2.

Clearly Julia’s performance is poorer than Python when we are using CSV.read.

Any thoughts on the same will be a great help.

Best regards,
Sourish

I don’t think anyone will be able to give any useful advice without the data sets you are using (is it the usual diamonds data used in Kaggle competitions?)

1 Like

Yes - it is available in the ggplot2 package in R and in RDatasets.jl as well.

1 Like

Can you show how this is split into the two days sets you have? The link I posted is just one csv file holding the complete data set.

What are the results when the variables are interpolated?

@benchmark test_julia_merge($diamonds1, $diamonds2)
@benchmark test_julia_merge($diam1, $diam2)

I split the dataset in the following way.

first(diam1,6)
6Γ—7 DataFrame
Row	ID	carat	cut	color	clarity	depth	table
Int64	Float64	String15	String1	String7	Float64	Float64
1	1	0.23	Ideal	E	SI2	61.5	55.0
2	2	0.21	Premium	E	SI1	59.8	61.0
3	3	0.23	Good	E	VS1	56.9	65.0
4	4	0.29	Premium	I	VS2	62.4	58.0
5	5	0.31	Good	J	SI2	63.3	58.0
6	6	0.24	Very Good	J	VVS2	62.8	57.0

and

first(diam2,6)

6Γ—5 DataFrame
Row	ID	price	x	y	z
Int64	Int64	Float64	Float64	Float64
1	1	326	3.95	3.98	2.43
2	2	326	3.89	3.84	2.31
3	3	327	4.05	4.07	2.31
4	4	334	4.2	    4.23	2.63
5	5	335	4.34	4.35	2.75
6	6	336	3.94	3.96	2.48

Sorry - I am not sure what is your point !

According to the BenchmarkTools.jl documentation, interpolation is recommended to avoid the problems of benchmarking with globals.

Rafael means that you should write @btime test_julia_merge($diam1, $diam2) to interpolate the global variables diam1 and diam2 into the benchmark. It doesn’t actually make a difference here.

What’s happening here is that CSV.read by default performs some optimizations when reading in the file - you can tell this from the column types:

julia> typeof.(eachcol(diamonds1))
7-element Vector{DataType}:
 Vector{Int64} (alias for Array{Int64, 1})
 Vector{Float64} (alias for Array{Float64, 1})
 Vector{String} (alias for Array{String, 1})
 Vector{String} (alias for Array{String, 1})
 Vector{String} (alias for Array{String, 1})
 Vector{Float64} (alias for Array{Float64, 1})
 Vector{Float64} (alias for Array{Float64, 1})

versus

julia> typeof.(eachcol(diam1))
7-element Vector{DataType}:
 Vector{Int64} (alias for Array{Int64, 1})
 Vector{Float64} (alias for Array{Float64, 1})
 PooledVector{String15, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String15, UInt32, 1, Array{UInt32, 1}})
 PooledVector{String1, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String1, UInt32, 1, Array{UInt32, 1}})
 PooledVector{String7, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String7, UInt32, 1, Array{UInt32, 1}})
 Vector{Float64} (alias for Array{Float64, 1})
 Vector{Float64} (alias for Array{Float64, 1})

On my machine I see for the DataFrames produced by CSVFiles:

julia> @btime test_julia_merge($diamonds1, $diamonds2);
  2.555 ms (349 allocations: 5.74 MiB)

and indeed with the default options for CSV.read:

julia> @btime test_julia_merge($diam1, $diam2);
  6.472 ms (377 allocations: 5.13 MiB)

but when disabling pooling:

julia> diam1_nopool = CSV.read("diamonds1.csv", DataFrame; pool = false);

julia> @btime test_julia_merge($diam1_nopool, $diam2);
  2.541 ms (349 allocations: 5.95 MiB)

So it might be that your problem is too small to benefit from pooling. There’s enough of a chance here though that this is a missed optimization in DataFrames that I’ll ping @bkamins to see if this is expected.

EDIT: I’ll add a complete MWE which downloads the data and creates the two separate data sets and then reads them back in below if people want to play with this:

Full MWE

julia> using BenchmarkTools, CSV, CSVFiles, DataFrames, Downloads

julia> full_table = CSV.read(Downloads.download(β€œhttps://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv”), DataFrame);

julia> full_table.ID = 1:nrow(full_table);

julia> CSV.write(β€œdiamonds1.csv”, select(full_table, :ID, :carat, :cut, :color, :clarity, :depth, :table));

julia> CSV.write(β€œdiamonds2.csv”, select(full_table, :ID, :price, :x, :y, :z));

julia> diamonds1=load(β€œdiamonds1.csv”, spacedelim=false, header_exists=true) |> DataFrame;

julia> diamonds2=load(β€œdiamonds2.csv”, spacedelim=false, header_exists=true) |> DataFrame;

julia> diam1 = CSV.read(β€œdiamonds1.csv”, DataFrame);

julia> diam2 = CSV.read(β€œdiamonds2.csv”, DataFrame);

julia> function test_julia_merge(df1,df2)
merged_df = outerjoin(df1, df2, on=:ID)
return merged_df
end;

julia> @btime test_julia_merge($diamonds1, diamonds2);
2.702 ms (349 allocations: 5.74 MiB)

julia> @btime test_julia_merge($diamonds1, $diamonds2);
2.555 ms (349 allocations: 5.74 MiB)

julia> @btime test_julia_merge($diam1, $diam2);
6.472 ms (377 allocations: 5.13 MiB)

4 Likes

@nilshg Thanks. This solves the problem.

Best regards,
Sourish

There’s some discussion in this epic issue:

but understandably seems to mostly focus on the case in which the pooled vectors are actually the vectors being joined on. I’m actually a little surprised that there’s such a big effect of pooling but it maybe this is related to the fact that the join turns all column eltypes from into T into Union{T, Missing} which maybe is more costly for pooled vectors?

This also seems to be an issue of outerjoin specifically, with leftjoin (which seems more appropriate in this example, althouh I appreciate it’s probably not your real application), I get:

julia> @btime test_julia_merge2($diamonds1, $diamonds2);
  2.017 ms (332 allocations: 5.58 MiB)

julia> @btime test_julia_merge2($diam1, $diam2);
  1.722 ms (356 allocations: 4.96 MiB)
1 Like

okay - I see. make sense.

Yes - I think pooling is the issue here. You can disable pooling in CSV.read if you do not need it.

I’ve shown above that it is indeed the issue, the question is it’s this expected and if not can something be done about it? Seems like a bit of a footgun that the default optimizations in a standard data science workflow lead to a 3x regression in outerjoin.

pool kwarg documentation Β· Issue #1090 Β· JuliaData/CSV.jl Β· GitHub is related, so can you please comment there.

The main issue is that for e.g. String1 pooling is not good most likely as it will use more memory and potentially be slower. For wide StringX types pooling will save memory.

1 Like

also related Always keep missing in pool by bkamins Β· Pull Request #65 Β· JuliaData/PooledArrays.jl Β· GitHub

1 Like
2 Likes

So is it fair to say in summary that the performance hit is expected, people that want to do outerjoins (or maybe other operations which have to change element types) should not use PooledArrays but given that PooledArrays are beneficial to most users in most situations we are happy to have it as the default in CSV.File?

Roughly yes. But since the case of outerjoin is common in Always keep missing in pool by bkamins Β· Pull Request #65 Β· JuliaData/PooledArrays.jl Β· GitHub I proposed to always technically keep the option of missing in the pool, so that doing β€œallowmissing” kind of operations would not lead to recalculation of pool.

1 Like