[pre-ANN] DataFlowTasks.jl: a package for inferring task dependencies based on data annotations

Dear Julia community,

Together with @ffevotte , @LaurentPlagne , and Vincent Sivadon, we have put together a package (DataFlowTasks) for automatically inferring implicit Task dependencies from explicitly annotated data dependencies.

The following example, taken from the documentation, illustrates the API:

using DataFlowTasks
A = Vector{Float64}(undef, 4)
result = let
    @dspawn fill!(@W(A), 0)           # task 1: accesses everything
    @dspawn @RW(view(A, 1:2)) .+= 2   # task 2: modifies the first half
    @dspawn @RW(view(A, 3:4)) .+= 3   # task 3: modifies the second half
    @dspawn @R(A)                     # task 4: get the result
end
fetch(result)

In the example above the last task (created with @dspawn) will wait for the second and third, which in turn will wait for the first one, because of a detected data conflict. The second and third tasks can be executed in parallel however.

More interesting examples where we believe the constant reuse of data (together with a nontrivial access pattern) makes this data flow approach more natural than say a functional approach can be found in the examples section of the docs where we showcase e.g. a pure Julia threaded cholesky factorization inpar with MKL (for large matrices)!

We are considering registering the package if others may find it useful, and would be happy to take into account feedback from the community.

16 Likes

This looks awesome! The use of macros is really clean. Would it be fair to say that this is a highly fine grained version of the Base.@atomic macro in Julia 1.8+?

Well, I kind of understand how one could see it that way, but we rather think about it in terms of an extension to Threads.@spawn, thanks to which you gain access to a way of specifying constraints on the order in which tasks are scheduled.

Consider the tiled Cholesky factorization example presented in the documentation. The core of its parallel implementation with DataFlowTasks looks like this (look at the full example to get more context if needed):

    # T[i,j] is a tile in the matrix to be factorized
    T = [view(A, tilerange(i, ts), tilerange(j, ts)) for i in 1:n, j in 1:n]

    for i in 1:n
        # Diagonal cholesky serial factorization
        @dspawn cholesky!(@RW(T[i,i]))

        # Left tiles update
        U = UpperTriangular(T[i,i])
        for j in i+1:n
            @dspawn ldiv!(@R(U)', @RW(T[i,j]))
        end

        # Submatrix update
        for j in i+1:n
            for k in j:n
                @dspawn mul!(@RW(T[j,k]), @R(T[i,j])', @R(T[i,k]), -1, 1)
            end
        end
    end

If we remove the @dspawn macros, we get a perfectly fine sequential implementation. If we replace @dspawn with Threads.@spawn, we get parallel version full of data races. Atomic operations could prevent data races, but they would not ensure that tasks are scheduled in the right order.
In our view, DataFlowTasks should be considered as a (hopefully nice) way of specifying which task operates on which data (which is simple here), so that minimal constraints on the scheduling order between tasks can be inferred automatically (which would be a bit tedious in this case, as illustrated by the complexity of the DAG of task-task dependencies)

DAG

But having this minimal set of task-task dependency constraints allows the tasks to be scheduled in a hopefully nearly optimal way, which in turn enables good speedup. To expand on the performance of the parallel tiled cholesky factorization, here is a comparison with MKL, taken from the “companion project” TiledFactorization.jl (where high sequential performance it attained thanks to LoopVectorization and TriangularSolve, and parallel performance comes from DataFlowTasks):

8 Likes

Great talk from @ffevotte and @maltezfaria !

3 Likes

Thanks for the link (and for reviving this thread)! Since the pre-release announcement was made last year, DataFlowTasks underwent some major changes, and we believe the package is now “ready for public consumption”.

If you have some complex, mostly in-place algorithm that you are trying to parallelize using Threads, DataFlowTasks may be able to help you: take a look at the docs for some not-so-trivial examples.

Community feedback and/or criticism would be greatly appreciated! We feel we somehow missed the target audience during the release announcement :slightly_frowning_face:

6 Likes

This is really cool! :heart:

Funny enough, we’ve (myself and @Rabab53) actually just added a new “datadeps” system to Dagger.jl which works basically the same way (except it’s not aware of views, but we plan to add that later), which extends the idea to distributed computing as well as multithreading (and GPUs in the near future). It’s probably not nearly as efficient as DataFlowTasks - the static scheduler that it implements is naive and uses round-robin scheduling - but we have plans to make it smarter.

Anyway, I think this system of declaring data dependencies for automatic parallelism is quite powerful for accelerating all kinds of codes, and so I hope we could collaborate and share ideas to make this style of programming more accessible and more widely used by the ecosystem :smiley:

7 Likes

Thanks for reaching out @jpsamaroo!

I think Dagger.jl is the way to go when it comes to hybrid parallel computing, and it is nice to see a datadeps mechanism becoming a part of it. It would be great if we had something like e.g. StarPU one day in Julia!

On the DataFlowTasks.jl side, we kept things really simple, and focused exclusively on overloading the high-level (native) task-based mechanics. At some point, we did have internal schedulers of various sorts, but as the package evolved we decided to simplify things as much as possible :sweat_smile:

On a separate note, at some point with @ffevotte we thought the JuliaParallel organization would be a good home for DataFlowTasks.jl, but we never dared to ask. Any thoughts on this?

Finally, I would be very interested in discussing some possible collaboration on this style of data-driven parallel programming!

4 Likes

After a long wait, Dagger’s datadeps system now has aliasing support! What this means: Dagger now automatically detects how task arguments alias with each other (through views, and also array wrappers like UpperTriangular and friends), and will use this information to optimize scheduling and memory transfers to exploit non-overlapping data access. The system is extensible via dispatch, allowing custom aliasing rules to be defined for a given argument type.

Additionally, GPU support is coming along - I have CUDA and ROCm single-GPU support working locally, and will be testing multi-GPU support soon as well. I’m also planning to bring-up Metal and oneAPI support shortly, to cover all the major GPU vendors. In the end, using GPUs automatically with datadeps shouldn’t require any changes to the user’s datadeps-powered algorithm - as long as each function call has an equivalent dispatch when handed CuArrays, ROCArrays, etc., then GPUs can be automatically utilized by datadeps, if enabled by the user/library.

Aside: I will be giving a Dagger workshop at JuliaCon 2024 that will have a section on programming with datadeps, so make sure to sign up for a workshop ticket if you’d like to learn how to best use this powerful new datadeps system!

11 Likes

Great news! I look forward to testing the new datadeps capabilities of Dagger.

1 Like