Help with performance tuning this dataframe aggregation

laborg · September 23, 2018, 1:17pm

Hi,

I think you are doing this a little bit to complicated. Especially the inner loops over columns m and n aren’t necessary.

Using Query.jl following solution is much faster:

function looptest7(data,loopsize)
	summary = DataFrame([Int64, Int64, Float64],[:cfg, :count, :ave],0)
	for i in 1:loopsize
		data.m .= fld.(data.x, i)
		for j in 1:loopsize
			cfg = i * j
			data.n .= fld.(data.y, j)

			x = @from i in data begin
				@group i.z by {i.m, i.n} into g
				@where length(g) == cfg
				@select minimum(g)
				@collect
			end

			push!(summary, [cfg length(x) mean(x)])
		end
	end
	return summary
end

julia> @btime looptest7(data,4)
  42.480 ms (184858 allocations: 13.03 MiB)

The not shared looptest6 was a similar to looptest5 but without using Query.jl and only using groupby. looptest7 is twice as fast, because groupby on DataFrames isn’t typestable… (see: Type of groupby(df,id) elements are Any)

Topic		Replies	Views
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	467	August 27, 2020
DataFrames operation scales badly Performance	21	2727	December 10, 2018
Bad performance of group_by of DataFrames - updated - General Usage performance	21	1249	October 23, 2019
Who does "better" than DataFrames? Performance dataframes	43	2020	April 6, 2023
Julia performs poorly on group-by benchmarks Data performance	48	5803	January 23, 2019

Help with performance tuning this dataframe aggregation

Related topics