Threading support for DataFrames transforms

If the transformation is thread-safe it would be safe and design is made to ensure this is done the way you want in the future.

For GroupedDataFrame we currently use multi-threading already for a sequence of operations.
The reason is that usually operations on GroupedDataFrame are more expensive so it was a higher priority to parallelize them.

combine under the hood does not differ from transform or select. It is parallelized for GroupedDataFrame but sequential for AbstractDataFrame.


Conclusions:

  1. The operations you do are in general very cheap. Therefore the results are influenced by pre/post processing steps that we do (like setting up proper structure of DataFrame object).
  2. The assumption that some operation can be safely done in multi-threading mode is fragile. If you passed a function that is not thread safe then actually we should not run the code in parallel.
  3. When designing multi-threading we need to take composability with other packages (like Dagger.jl) into account.
  4. What is currently supported in terms of multi-threading in DataFrames.jl is described at Functions · DataFrames.jl.

In short term this means that if you want multi-threading in all cases (and you know it can be safely done) I recommend you to do @spawn or similar manually (I know it is not a 100% satisfying solution).

In the long run all you ask for will land in DataFrames.jl but it will not be soon (i.e. not in 1.4 release). However, I hope we can agree on the API in 1.4 release. The relevant discussion is in this PR Add a keyword argument to disable multithreading by nalimilan · Pull Request #3030 · JuliaData/DataFrames.jl · GitHub.

In particular, before moving forward, the crucial issue is that one mentally needs to separate two things:

  • if the operation you want to run must be run sequentially or potentially it is allowed to be run in parallel;
  • how much parallelism allow DataFrames.jl to use (i.e. even if the operation is thread safe you potentially might not want to turn on multi-threading because you are running some other operations in parallel that you do not want to be disturbed).

These are quite complex decisions to be made (with far reaching consequences for the whole ecosystem) so we do not want to rush with making some choice and regretting it later (actually we should have had this discussion before enabling multi-threading for GroupedDataFrame). You are welcome to share your thoughts/expectations in that PR so that we can end up with a solution that is useful and useable.

For the time being - as I have commented - users wanting multi-threading in all cases should implement it manually (which is not that hard, but of course it is not an ideal solution).

4 Likes