Hi
I was trying to merge two data frames - but I realised how we read the csv files - that is going to have a bearing on the same merge
function.
using DataFrames, CSV, FileIO, CSVFiles, PyCall, BenchmarkTools
diamonds1=load("diamonds1.csv", spacedelim=false, header_exists=true) |> DataFrame;
diamonds2=load("diamonds2.csv", spacedelim=false, header_exists=true) |> DataFrame;
first(diamonds2,6)
6Γ5 DataFrame
Row ID price x y z
Int64 Int64 Float64 Float64 Float64
1 1 326 3.95 3.98 2.43
2 2 326 3.89 3.84 2.31
3 3 327 4.05 4.07 2.31
4 4 334 4.2 4.23 2.63
5 5 335 4.34 4.35 2.75
6 6 336 3.94 3.96 2.48
Now I read the same datasets using the CSV.read
diam1 = CSV.read("diamonds1.csv", DataFrame)
diam2 = CSV.read("diamonds2.csv", DataFrame);
Now I write a function to merge two data frames
function test_julia_merge(df1,df2)
merged_df = outerjoin(df1, df2, on=:ID)
return merged_df
end
@benchmark test_julia_merge(diamonds1,diamonds2)
BenchmarkTools.Trial: 2336 samples with 1 evaluation.
Range (min β¦ max): 1.784 ms β¦ 10.273 ms β GC (min β¦ max): 0.00% β¦ 70.07%
Time (median): 1.969 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.138 ms Β± 495.263 ΞΌs β GC (mean Β± Ο): 7.97% Β± 13.29%
βββββββ
βββββ β β β
βββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββ
ββββββ β
1.78 ms Histogram: log(frequency) by time 3.74 ms <
Memory estimate: 5.74 MiB, allocs estimate: 381.
@benchmark test_julia_merge(diam1,diam2)
BenchmarkTools.Trial: 960 samples with 1 evaluation.
Range (min β¦ max): 4.632 ms β¦ 18.102 ms β GC (min β¦ max): 0.00% β¦ 46.31%
Time (median): 4.976 ms β GC (median): 0.00%
Time (mean Β± Ο): 5.208 ms Β± 847.561 ΞΌs β GC (mean Β± Ο): 2.19% Β± 5.25%
βββββ
βββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ β
4.63 ms Histogram: frequency by time 7.85 ms <
Memory estimate: 5.13 MiB, allocs estimate: 409.
Clearly, CSV.read
is affecting the performance of the merge
significantly.
Then, I compared the performance of merge against Python
. Wrote the code using PyCall
on the same notebook
.
# Measuring time for merging two data frames in Python:
py"""
import pandas as pd
df1 = pd.read_csv("diamonds1.csv")
df2 = pd.read_csv("diamonds2.csv")
# Merge DataFrames on the 'ID' column
def merge_df():
return pd.merge(df1, df2, on='ID', how='outer')
"""
merge_py_df = py"merge_df"
res_py = @benchmark $merge_py_df()
BenchmarkTools.Trial: 1220 samples with 1 evaluation.
Range (min β¦ max): 3.751 ms β¦ 28.986 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.995 ms β GC (median): 0.00%
Time (mean Β± Ο): 4.093 ms Β± 783.180 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββ
βββββββββββββ
β
β
ββββββββββββββββββββββββββββββββββββββββββββ β
3.75 ms Histogram: frequency by time 5.63 ms <
Memory estimate: 128 bytes, allocs estimate: 2.
Clearly Julia
βs performance is poorer than Python
when we are using CSV.read
.
Any thoughts on the same will be a great help.
Best regards,
Sourish