Efficient parallelization of in place transforms

I wanted to revisit this discussion from Pmap with in place functions that I started 2 years ago. Generalizing a bit, suppose I have an in place transform f!(x) and array of data, xvals, on which I want to apply the transform:

f!.(xvals)

Being open to multithreading/multiprocessing, is there a straightforward way to modify this to take advantage of a shared memory environment? Ideally, I’d like to have a piece of code that will still execute properly in a purely serial situation.

FoldsDagger.jl contains a proof-of-concept implementation for this. However, we are still searching for a good Dagger API, for supporting this type of computation: Correctly implement in-place mutation · Issue #7 · JuliaFolds/FoldsDagger.jl · GitHub

Once this is implemented correctly, it should be possible to use Folds.map! to express this type of computation. Alternatively, you can combine FLoops.jl and Referenceables.jl (for easier multi-output support)

@floop executor for x in referenceable(xs)
    x[] += 1
end

This kind of code is already executable on a single thread, multiple threads, and GPU. We need a good distributed collections library to make it work in distributed environments.