[ANN] Folds.jl: threaded, distributed, and GPU-based high-level data-parallel interface for Julia

tkf · February 5, 2021, 6:34pm

Folds.jl provides a unified interface for various parallelized and sequential folds (sum, maximum, map, unique, etc.) based on Transducers.jl API. If you know ThreadsX.jl,^[1] it provides a similar interface but with arbitrary executor that specifies how to execute a certain fold. For example, it is easy to switch to use Distributed.jl-based executor instead of the default thread-based executor:

julia> using Folds

julia> Folds.sum(1:10)
55

julia> Folds.sum(1:10, ThreadedEx())  # equivalent to above
55

julia> Folds.sum(1:10, DistributedEx())
55

Although the list of functions (see the documentation) might look small, Folds.jl is very expressive because it understands Julia’s iterator comprehension:

julia> Folds.sum(y for x in 1:10 if isodd(x) for y in 1:x^2)
4917

Transducers.jl provides an alternative interface for composing iterator transformations that Folds.jl understands:

julia> using Transducers

julia> 1:10 |> Filter(isodd) |> MapCat(x -> 1:x^2) |> Folds.sum
4917

High-level interface defined on top of a composable ecosystem (wishlist)

As I just posted in [ANN] FoldsThreads.jl: A zoo of pluggable thread-based data-parallel execution mechanisms, you can pass a specific threaded executor that fits with your scenario to Folds.jl functions.

Extending this idea, I think it’d be interesting to take the separation of algorithms, executors, and data structures to the extreme and make it work on distributed and GPU settings for writing various kinds of loops. A big picture view would look like something like this:

,-----------------------------------------------------------------------.
|                    [ High-level user code ]                           |
|-----------------------------------------------------------------------|
|                        Folds, FLoops, ...                             |
|=======================================================================|
|  [ Algorithms ]  |     [ Executors ]       |  [ Data structures ]     |
|------------------|-------------------------|--------------------------|
|  Transducers,    | FoldsCore*, FoldsCUDA†, | Array, CuArray, DArray,  |
|  OnlineStats,    | FoldsThreads,           | Tables, FGenerators,     |
|  DataTools, ...  | FoldsDagger†, ...       | Dict, Set, ...           |
|=======================================================================|
|                 FoldsBase*, SplittablesBase, ...                      |
`-----------------------------------------------------------------------'

* Currently, Transducers.jl plays the role of FoldsCore.jl and FoldsBase.jl
† FoldsCUDA.jl is usable but sill work-in-progress. FoldsDagger.jl is very work-in-progress

That is to say, we have a high-level interfaces like FLoops.jl and Folds.jl that cover “90%” of user-defined loops while we can pick up specific lower level interfaces and compose fold algorithms, executors, and data structures whenever it’s required. Data structures, including “composite” data structures like zip, product, gropuby etc., specify the input elements to the loop. Executors schedules and execute the loop provide various execution strategies for the loop that are optimized for different need of computations. Finally fold algorithms provide composable tools for constructing the loop body (e.g., Transducers, OnlineStats, …).

Defining composable interface would be very challenging especially on GPU and distributed settings. SplittablesBase.jl is a good starting point for thread-based parallelism but distributed- and GPU-computing have unique challenges. I think something based on the idea of list homomorphisms (especially distributable homomorphism; e.g., Gorlatch & Lengauer, 2000) is the way to go. It’d be interesting to play with this idea and implement distributed data structures on top of Dagger.jl.

It would be great if we can have a rich ecosystem of custom data structures in distributed- and GPU-computing. It would be even possible to have an ecosystem of custom fold algorithms and executors if we can come up with the right set of API. As we all know, we already have a rich ecosystem of data structures in the sequential world, thanks to the unreasonable effectiveness of multiple dispatch and well thought-out interfaces (especially for custom arrays). I think Julia is a rare language that can pull it off in the parallel world.

ThreadsX.jl stays there, since it also includes functions that cannot be implemented as a fold (e.g., sorting) and has no pluggable execution mechanism (…yet?). ↩︎

datnamer · February 5, 2021, 8:03pm

This is a fantastic vision! Where do you see KernelAbstractions fitting in here?

It does a type of loop thing with multiple executors.

dmolina · February 5, 2021, 8:06pm

Nice work, @tkf. It seems very nice and useful .

Do you think that when it is more madure it could be used as a replacement of ThreadsX.jl with a more general behavior? I am following these packages with great interest, and I have to recognise sometimes I am a little lost.

By the way, in your examples I see several times expressions like ex = has_cuda_gpu() ? CUDAEx() : ThreadedEx() or similar. It suggest to create a auxiliar functions to return the Executor following these common conditions, to make it easier to user. For instance, it could be CUDAOrThreadedEx() or something like that. It avoid repetitive code, and it could be also easier to read (CUDA when it is possible, and Threads in other case).

Thanks again, these packages have a lot of potential!

tkf · February 5, 2021, 10:25pm

Yeah, implementing an executor based on KernelAbstractions totally makes sense since I don’t want to write similar code for different GPU vendors.

I think the code quality is pretty comparable since Folds.jl is started as a copy-and-paste of ThreadsX (and then generalized). Non-folds parts like sorting is much harder to generalize (in the sense that I know it’s possible but I don’t have a good intuition about the performance impact).

There is a dispatch/promotion pipeline for choosing an appropriate executor based on input collection. So, e.g., Folds.sum(f, ::CuArray) and @floop for x in xs::CuArray should use CUDA automatically (in principle…).

The reason why I have executor is that, sometimes I want to choose the execution mechanism independent of the input collection (e.g., Folds.sum(f, 1:n, CUDAEx())).

Yeah, I think I get the usecase. Maybe even some kind of combinator Or(CUDAEx(), ThreadedEx()) to specify the fallbacks. But I haven’t designed the promotion mechanism carefully yet and implementing something like it would need more thoughts.

affans · February 5, 2021, 10:36pm

Is there an ELI5 for what a fold is? I tried reading over the documentation of Transducers but seems almost the same functionality that’s in Base. Is there anything particular about sum in Transducers vs sum in Base?

tkf · February 5, 2021, 10:47pm

Base sum is single-threaded while Folds.sum is parallel by default. Also, you can do complex pre-processing with transducers, when you need. If you want to dig into the mechanism of transducers in details, I did a step-by-step “derivation” of foldl and transducers in my JuliaCon talk: JuliaCon 2019 | Transducers: data-oriented abstraction for sequential and parallel algorithms - YouTube

rveltz · February 5, 2021, 11:05pm

They should hire you at Julia Computing…

tkf · February 5, 2021, 11:06pm

I’m at Julia Lab now

rveltz · February 5, 2021, 11:19pm

Congrats!! It must be recent because Julia Lab: Language, Composability, and Scientific Machine Learning (SciML) …

tkf · February 5, 2021, 11:30pm

Thanks! Maybe the webpage not updated for a while. There’s People · Julia Lab at MIT CSAIL · GitHub

datnamer · February 7, 2021, 12:33am

I mean KA as a frontend. It provides loopy syntax and semantics as well.

tkf · February 7, 2021, 12:48am

I don’t think that’s the right “layering” of the abstraction since KA is very array-focused and GPU-inspired. I’d like to support all, or at least a large class of, parallelizable collections and I think it’d be out of scope for KA (for good reasons). OTOH, since many high-performance collections would use arrays internally, using KA as a lower-level layer makes sense.

Having said that, it should be pretty straightforward to extract out the executors in FoldsThreads.jl in a way reusable as CPU backends for KA.