This benchmark shows that DataFrames is much slower and more verbose than data.table:
https://h2oai.github.io/db-benchmark/
I was wondering if DataFrames could reach data.table’s performance in future.
This benchmark shows that DataFrames is much slower and more verbose than data.table:
https://h2oai.github.io/db-benchmark/
I was wondering if DataFrames could reach data.table’s performance in future.
From what I understand, a ton (of potential). First of all, it isn’t always that far behind in those benchmarks. Second of all, last I remember the DataFrames.jl squad were thinking about how to best optimize grouping operations, including when it might be appropriate to parallelize automatically. For example, if you are just taking the sum or mean of a column within a group, as most of those benchmarks do, that can obviously just be done in parallel.
Given what happened with CSV.jl (it was no good on benchmarks → now it is the fastest CSV reader out there), I completely think it is only a matter of time before DataFrames.jl steps out ahead. But the team isn’t solely focused on performance, usability is important too, so lots of thought and planning takes place before blindly implementing new things that might speed it up.
This benchmark site is quite good imho, but it only covers two data frame functionalities, group by and join.
For the first, DataFrames.jl is already quite good (usually top 3), but data.tables (which received much more resources for optimizations) is still a bit ahead. Join performance is something the DataFrames.jl developers are actively working on.
Other important data frame functionalities not tested by this benchmark are:
df.c = df.a .* df.b
) - these are pure Julia array operations in DataFrames.jl, and they are very fast (at least on par with Numpy, see NumpyJuliaPerformance.jl · GitHub).Note that faster joins
have already started to arrive on master: https://github.com/JuliaData/DataFrames.jl/pull/2612
Why Julia arrays are fast but DataFrames are not so fast?
I’m not sure what you mean by this, but DataFrames are just collections of arrays. Everything implemented in DataFrames is implemented on base julia arrays. All the operations in DataFrames, like
transform(df, [:x, :y] => ((x, y) -> x .+ y .- cor(x, y)) => :z)
will be just as fast as doing
df.x .+ df.y .- cor(df.x, df.y)
If you want speed, you need to use TypedTables
where the compiler has knowledge of the type of each column when iterating over rows.
Notice, if you have a lot of columns, forcing type information of each column through the compiler will make things slower because the compiler is the limiting fact but then.
This isn’t strictly correct. If you want to have functions like
function f(df)
s = 0.0
for i in 1:nrow(df)
s += df.x[i] + df.y[i]
end
s
end
then yes, this would benefit from TypesTables, since the compiler doesn’t know what types df.x
and df.y
are.
But if you add any function barrier, you will recover performance.
function add(x, y)
s = 0.0
for i in 1:length(x)
s += x[i] + y[i]
end
end
add(df.x, df.y)
DataFramesMeta does this automatically, with @with
. The following will be just as fast as above
@with df begin
s = 0.0
for i in 1:length(:x)
s += :x[i] + :y[i]
end
s
end
So yes, TypesTables is faster in one sense, but it’s very easy to work with DataFrames so that you recover all performance losses.