Parallelize groupby combine

For transform(df, cols => ByRow(f)) I can parallelize by replacing ByRow with a parallel map . Is it possible to parallelize combine(groupby(df, cols), ...) ?

see Functions · DataFrames.jl

1 Like

There are the following things that could be parallelized in your question:

  • groupby: it is parallelized
  • parallelizing multiple operations in ... - this is already done by default (you can disable it if you want)
  • parallelize a single operation that is a custom function that produces one row per group - this is already done by default (you can disable it if you want)
  • parallelize a single operation that is a custom function that produces many rows per group - currently parallelizing it is not supported (the reason is that composing an output of such an operation is hard to parallelize)
  • parallelize a single operation that is a standard function that is optimized (like mean, sum) - this is currently not supported, we might add it, but we have not done so yet, because the custom aggregations we now have that use a single thread are fast and it was hard to find a good threshold when enabling multithreading gave benefits (for sure for tables having less than 1000000 rows the cost of spawning tasks was bigger than the benefit)
1 Like