Tutorial: Efficient and safe approaches to mutation in data parallelism

Hi, I just wrote another tutorial on data-parallel programming in Julia: Efficient and safe approaches to mutation in data parallelism. This is a sequel of A quick introduction to data parallelism in Julia.

As discussed in a quick introduction to data parallelism, data parallel style lets us write fast, portable, and generic parallel programs. One of the main focuses was to unlearn the “sequential idiom” that accumulates the result into mutable state. However, mutable state is sometimes preferred for efficiency. After all, a fast parallel program is typically a composition of fast sequential programs. Furthermore, managing mutable states is sometimes unavoidable for interoperability with libraries preferring or requiring mutation-based API. However, sharing mutable state is almost always a bad idea. Naively doing so likely results in data races and hence programs with undefined behaviors. Although low-level concurrency APIs such as locks and atomics can be used for writing (typically) inefficient but technically correct programs, a better approach is to use single-owner local mutable state. In particular, we will see that unlearning sequential idiom was worth the effort since it points us to what we call ownership-passing style that can be used to construct mutation-based parallel reduction from mutation-free (“purely functional”) reduction as an optimization.

This tutorial provides an overview of the mutable object handling in data-parallel Julia programs. It also discusses the effect and analysis of false sharing which is a major performance pitfall when using in-place operations in a parallel program.

Table of contents:

  1. Example: multiplying and adding matrices
    1. Advanced: fusing multiplication and addition in base cases
  2. Categorizing mutation use-cases
  3. Filling outputs
    1. Pitfalls with filling pre-allocated outputs
  4. In-place reductions
    1. Flexible reduction with FLoops.@reduce
      1. @reduce(acc = op(init, input)) example
      2. Ownership-passing style
      3. @reduce() do example
      4. General form of @reduce() do syntax
      5. Ownership-passing style: second argument
    2. Initializing mutable accumulator using Transducers.OnInit
    3. Combining containers
    4. OnlineStats
    5. Pitfalls with mutable reduction states
  5. Mutable temporary objects (private variables)
  6. Accidental mutations
  7. Advanced/Performance: False sharing
    1. Analyzing false sharing using perf c2c
  8. Advanced: adjoining trick
20 Likes

By the way, I’m by no means an expert on CPU hardware and have very little understanding of how the coherence protocol actually works. In particular, I don’t know how on earth certain false sharing does not happen in Intel and IBM CPUs. So, I appreciate if the experts comment on/fact check the false sharing part.

(Of course, comments on other points are also welcome too :slight_smile: )

2 Likes

Why in all append examples tuple is used? For example, this

using FLoops

@floop for x in 1:10
    if isodd(x)
        @reduce(odds = append!(Int[], (x,)))
    else
        @reduce(evens = append!(Int[], (x,)))
    end
end

produces the same result as

@floop for x in 1:10
    if isodd(x)
        @reduce(odds = append!(Int[], x))
    else
        @reduce(evens = append!(Int[], x))
    end
end

Also

ys = Folds.mapreduce(tuple, withinit(() -> Int[], append!), 1:10; init = Init())

produces the same result as

ys = Folds.reduce(withinit(() -> Int[], append!), 1:10; init = Init())

so, why do we need this extra tuple conversion?

That’s a good point that touches the corner of Julia I wish I can ignore :slight_smile: Basically, the code without tuple-wrapping is working “by accident” (in some sense). It’s actually nothing to do with parallelism. Consider the implementations of collect with and without the tuple-wrapping;

julia> function collect1(xs)
           ys = Any[]
           for x in xs
               append!(ys, x)  # no tuple
           end
           return ys
       end;

julia> function collect2(xs)
           ys = Any[]
           for x in xs
               append!(ys, (x,))  # tuple
           end
           return ys
       end;

As you’ve observed, there is no difference if the input is a collection of numbers (which is a singleton collection of itself):

julia> collect1(1:3)
3-element Vector{Any}:
 1
 2
 3

julia> collect2(1:3)
3-element Vector{Any}:
 1
 2
 3

However, your version won’t work when each element is also a collection (note: Base.collect(1 => 2) == [1, 2]):

julia> collect1(x => x^2 for x in 1:3)
6-element Vector{Any}:
 1
 1
 2
 4
 3
 9

julia> collect2(x => x^2 for x in 1:3)
3-element Vector{Any}:
 1 => 1
 2 => 4
 3 => 9

In other words, you are observing that

for e in x
    ...
end

and

for e in (x,)
    ...
end

are equivalent if x isa Number.

So, in the examples using Int, you are correct that there is no need to use a tuple. However, since this is a tutorial, it is important to demonstrate the fundamental pattern in parallel computing:

  1. Create a singleton solution (e.g., the singleton tuple)
  2. Combine the solutions (e.g., append!)

I was hoping that explicitly constructing the singleton solution like (x,) helps the readers to familiarize themselves with this pattern.

5 Likes