Parallelize groupby combine

jar1 · December 15, 2022, 8:03pm

For transform(df, cols => ByRow(f)) I can parallelize by replacing ByRow with a parallel map . Is it possible to parallelize combine(groupby(df, cols), ...) ?

bkamins · December 15, 2022, 8:04pm

see Functions · DataFrames.jl

bkamins · December 15, 2022, 8:05pm

There are the following things that could be parallelized in your question:

groupby: it is parallelized
parallelizing multiple operations in ... - this is already done by default (you can disable it if you want)
parallelize a single operation that is a custom function that produces one row per group - this is already done by default (you can disable it if you want)
parallelize a single operation that is a custom function that produces many rows per group - currently parallelizing it is not supported (the reason is that composing an output of such an operation is hard to parallelize)
parallelize a single operation that is a standard function that is optimized (like mean, sum) - this is currently not supported, we might add it, but we have not done so yet, because the custom aggregations we now have that use a single thread are fast and it was hard to find a good threshold when enabling multithreading gave benefits (for sure for tables having less than 1000000 rows the cost of spawning tasks was bigger than the benefit)

Topic		Replies	Views
Multi-threading to the “combine” function Performance multithreading , dataframes , piping	5	892	May 20, 2022
Dataframes with groupby in parallel Performance parallel , distributed , dataframes , pmap	1	1361	October 1, 2020
DataFrame groups as an argument of a function General Usage question , dataframes	15	919	November 23, 2021
Threading support for DataFrames transforms Data multithreading , dataframes	10	1366	May 6, 2022
Understanding the performance issue in combine() [DataFrames.jl] Performance dataframes	1	330	April 18, 2021

Parallelize groupby combine

Related topics