Who does "better" than DataFrames?

rocco_sprmnt21 · April 4, 2023, 8:26pm

Maybe I’m missing something but, at first accept, I would say that this solution is based on the specific situation that the dataframe is already sorted by groups.
The test should be done in the case of a “shuffled” dataframe to get a more meaningful answer.

using Random
s=shuffle(repeat(1:10^6, inner=4))
df=DataFrame(;s,t,r)
@assert combine(groupby(df, :s),:r=>maximum).r_maximum == vec(maximum(reshape(r, 4, :), dims=1))
ERROR: AssertionError: (combine(groupby(df, :s), :r => maximum)).r_maximum == vec(maximum(reshape(r, 4, :), dims = 1))
Stacktrace:
 [1] top-level scope
   @ c:\Users\sprmn\.julia\environments\v1.8.3\dataframes33.jl:414

edit
Now I read better the premise of the post, which specifies the scope of validity of the proposed solution.

in any case, the proposal, adapted to the general situation, does not disfigure at all.

julia> @btime begin
       sort!(df,:s)
       maximum(reshape(df.r, 4, :), dims=1)
       end
  30.534 ms (22 allocations: 114.44 MiB)
1×1000000 Matrix{Float64}:
 0.742695  0.952913  0.884315  …  0.771818  0.929275  0.943746

julia> @btime combine(groupby(df, :s),:r=>maximum)
  32.264 ms (348 allocations: 55.33 MiB)
1000000×2 DataFrame

a variant that makes better use of the particularity

julia> @btime begin
       sort!(df,[:s,:r])
       reshape(df.r, 4, :)[4,:]
       end
  25.522 ms (45 allocations: 114.44 MiB)
1000000-element Vector{Float64}:

ericphanson · April 4, 2023, 11:40pm

My point is, regarding

It’s a shame these optimizations are only available for a single type of a single package.

that I don’t think it’s a big deal if it only works for one type or not, since you can always wrap your types in another to access the functionality (and that doesn’t need to cost in performance). I don’t really agree re- “flat tables” vs base types; a DataFrame is just a collection of vectors and doesn’t imply the contents must be “flat”. Agreed that StructTypes provides very similar functionality.

aplavin · April 5, 2023, 7:59am

Well, for me it seems pretty weird if for performance instead of simply unique(s) users need to install DataFrames, then do:

using DataFrames
df = DataFrame(;s)
combine(groupby(df, :s), :s => first).s_first

(example from slightly above in this thread).

Yeah, it can store any data type in columns, but common dataframe operations only work at the “top level” – on the level of columns as a whole. Sure, you can store the dataset as-is in one of the columns, but we are hardly speaking about convenience anymore. And this only works if your data is a vector, not an n-dim array. Additionally, many functions take real collections in Julia, and dataframes don’t follow the collection interface.

Anyway, not sure what exactly is your main point. I just said it would be great if these performance optimizations were available for many other julia container types. Sure, it’s lots of work for identifying common, generally useful blocks and using them to optimize even Base functions. So, this may even never be done…

JackDevine · April 6, 2023, 12:55am

Yes, that does make it slightly more general, but it still isn’t entirely general since it fails if one of the groups doesn’t have 4 elements. So you have to be careful when using that strategy and you would probably want to add assertions/checks to your code to make sure that problematic data doesn’t slip through.

I agree and I think that the optimizations should be added to Base. Although first, it would be nice to try lots of different scenarios (random data, string data, data where checking equality is expensive) to see if the performance improvement is uniform.

Also, kudos to the DataFrames.jl developers for beating Base for this particular task!

Topic		Replies	Views
ByRow vs broadcasting performance Performance question	0	272	August 20, 2022
Similar performance using DataFrames or DataFramesMeta? General Usage	0	266	November 7, 2019
Help with performance tuning this dataframe aggregation Performance	10	738	September 23, 2018
Understanding the performance issue in combine() [DataFrames.jl] Performance dataframes	1	330	April 18, 2021
What is the recommended way to filter rows of a Dataframe? Performance dataframes	6	293	July 23, 2024

Who does "better" than DataFrames?

Related topics