Folds.jl provides a unified interface for various parallelized and sequential folds (sum
, maximum
, map
, unique
, etc.) based on Transducers.jl API. If you know ThreadsX.jl,[1] it provides a similar interface but with arbitrary executor that specifies how to execute a certain fold. For example, it is easy to switch to use Distributed.jl-based executor instead of the default thread-based executor:
julia> using Folds
julia> Folds.sum(1:10)
55
julia> Folds.sum(1:10, ThreadedEx()) # equivalent to above
55
julia> Folds.sum(1:10, DistributedEx())
55
Although the list of functions (see the documentation) might look small, Folds.jl is very expressive because it understands Julia’s iterator comprehension:
julia> Folds.sum(y for x in 1:10 if isodd(x) for y in 1:x^2)
4917
Transducers.jl provides an alternative interface for composing iterator transformations that Folds.jl understands:
julia> using Transducers
julia> 1:10 |> Filter(isodd) |> MapCat(x -> 1:x^2) |> Folds.sum
4917
High-level interface defined on top of a composable ecosystem (wishlist)
As I just posted in [ANN] FoldsThreads.jl: A zoo of pluggable thread-based data-parallel execution mechanisms, you can pass a specific threaded executor that fits with your scenario to Folds.jl functions.
Extending this idea, I think it’d be interesting to take the separation of algorithms, executors, and data structures to the extreme and make it work on distributed and GPU settings for writing various kinds of loops. A big picture view would look like something like this:
,-----------------------------------------------------------------------.
| [ High-level user code ] |
|-----------------------------------------------------------------------|
| Folds, FLoops, ... |
|=======================================================================|
| [ Algorithms ] | [ Executors ] | [ Data structures ] |
|------------------|-------------------------|--------------------------|
| Transducers, | FoldsCore*, FoldsCUDA†, | Array, CuArray, DArray, |
| OnlineStats, | FoldsThreads, | Tables, FGenerators, |
| DataTools, ... | FoldsDagger†, ... | Dict, Set, ... |
|=======================================================================|
| FoldsBase*, SplittablesBase, ... |
`-----------------------------------------------------------------------'
-
*
Currently, Transducers.jl plays the role of FoldsCore.jl and FoldsBase.jl -
â€
FoldsCUDA.jl is usable but sill work-in-progress. FoldsDagger.jl is very work-in-progress
That is to say, we have a high-level interfaces like FLoops.jl and Folds.jl that cover “90%” of user-defined loops while we can pick up specific lower level interfaces and compose fold algorithms, executors, and data structures whenever it’s required. Data structures, including “composite” data structures like zip
, product
, gropuby
etc., specify the input elements to the loop. Executors schedules and execute the loop provide various execution strategies for the loop that are optimized for different need of computations. Finally fold algorithms provide composable tools for constructing the loop body (e.g., Transducers, OnlineStats, …).
Defining composable interface would be very challenging especially on GPU and distributed settings. SplittablesBase.jl is a good starting point for thread-based parallelism but distributed- and GPU-computing have unique challenges. I think something based on the idea of list homomorphisms (especially distributable homomorphism; e.g., Gorlatch & Lengauer, 2000) is the way to go. It’d be interesting to play with this idea and implement distributed data structures on top of Dagger.jl.
It would be great if we can have a rich ecosystem of custom data structures in distributed- and GPU-computing. It would be even possible to have an ecosystem of custom fold algorithms and executors if we can come up with the right set of API. As we all know, we already have a rich ecosystem of data structures in the sequential world, thanks to the unreasonable effectiveness of multiple dispatch and well thought-out interfaces (especially for custom arrays). I think Julia is a rare language that can pull it off in the parallel world.
-
ThreadsX.jl stays there, since it also includes functions that cannot be implemented as a fold (e.g., sorting) and has no pluggable execution mechanism (…yet?). ↩︎