Rafael means that you should write @btime test_julia_merge($diam1, $diam2)
to interpolate the global variables diam1
and diam2
into the benchmark. It doesn’t actually make a difference here.
What’s happening here is that CSV.read
by default performs some optimizations when reading in the file - you can tell this from the column types:
julia> typeof.(eachcol(diamonds1))
7-element Vector{DataType}:
Vector{Int64} (alias for Array{Int64, 1})
Vector{Float64} (alias for Array{Float64, 1})
Vector{String} (alias for Array{String, 1})
Vector{String} (alias for Array{String, 1})
Vector{String} (alias for Array{String, 1})
Vector{Float64} (alias for Array{Float64, 1})
Vector{Float64} (alias for Array{Float64, 1})
versus
julia> typeof.(eachcol(diam1))
7-element Vector{DataType}:
Vector{Int64} (alias for Array{Int64, 1})
Vector{Float64} (alias for Array{Float64, 1})
PooledVector{String15, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String15, UInt32, 1, Array{UInt32, 1}})
PooledVector{String1, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String1, UInt32, 1, Array{UInt32, 1}})
PooledVector{String7, UInt32, Vector{UInt32}} (alias for PooledArrays.PooledArray{String7, UInt32, 1, Array{UInt32, 1}})
Vector{Float64} (alias for Array{Float64, 1})
Vector{Float64} (alias for Array{Float64, 1})
On my machine I see for the DataFrames produced by CSVFiles
:
julia> @btime test_julia_merge($diamonds1, $diamonds2);
2.555 ms (349 allocations: 5.74 MiB)
and indeed with the default options for CSV.read
:
julia> @btime test_julia_merge($diam1, $diam2);
6.472 ms (377 allocations: 5.13 MiB)
but when disabling pooling:
julia> diam1_nopool = CSV.read("diamonds1.csv", DataFrame; pool = false);
julia> @btime test_julia_merge($diam1_nopool, $diam2);
2.541 ms (349 allocations: 5.95 MiB)
So it might be that your problem is too small to benefit from pooling. There’s enough of a chance here though that this is a missed optimization in DataFrames that I’ll ping @bkamins to see if this is expected.
EDIT: I’ll add a complete MWE which downloads the data and creates the two separate data sets and then reads them back in below if people want to play with this:
Full MWE
julia> using BenchmarkTools, CSV, CSVFiles, DataFrames, Downloads
julia> full_table = CSV.read(Downloads.download(“https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv”), DataFrame);
julia> full_table.ID = 1:nrow(full_table);
julia> CSV.write(“diamonds1.csv”, select(full_table, :ID, :carat, :cut, :color, :clarity, :depth, :table));
julia> CSV.write(“diamonds2.csv”, select(full_table, :ID, :price, :x, :y, :z));
julia> diamonds1=load(“diamonds1.csv”, spacedelim=false, header_exists=true) |> DataFrame;
julia> diamonds2=load(“diamonds2.csv”, spacedelim=false, header_exists=true) |> DataFrame;
julia> diam1 = CSV.read(“diamonds1.csv”, DataFrame);
julia> diam2 = CSV.read(“diamonds2.csv”, DataFrame);
julia> function test_julia_merge(df1,df2)
merged_df = outerjoin(df1, df2, on=:ID)
return merged_df
end;
julia> @btime test_julia_merge($diamonds1, diamonds2);
2.702 ms (349 allocations: 5.74 MiB)
julia> @btime test_julia_merge($diamonds1, $diamonds2);
2.555 ms (349 allocations: 5.74 MiB)
julia> @btime test_julia_merge($diam1, $diam2);
6.472 ms (377 allocations: 5.13 MiB)