Tutorial: Efficient and safe approaches to mutation in data parallelism

tkf · May 31, 2021, 1:02am

Hi, I just wrote another tutorial on data-parallel programming in Julia: Efficient and safe approaches to mutation in data parallelism. This is a sequel of A quick introduction to data parallelism in Julia.

As discussed in a quick introduction to data parallelism, data parallel style lets us write fast, portable, and generic parallel programs. One of the main focuses was to unlearn the “sequential idiom” that accumulates the result into mutable state. However, mutable state is sometimes preferred for efficiency. After all, a fast parallel program is typically a composition of fast sequential programs. Furthermore, managing mutable states is sometimes unavoidable for interoperability with libraries preferring or requiring mutation-based API. However, sharing mutable state is almost always a bad idea. Naively doing so likely results in data races and hence programs with undefined behaviors. Although low-level concurrency APIs such as locks and atomics can be used for writing (typically) inefficient but technically correct programs, a better approach is to use single-owner local mutable state. In particular, we will see that unlearning sequential idiom was worth the effort since it points us to what we call ownership-passing style that can be used to construct mutation-based parallel reduction from mutation-free (“purely functional”) reduction as an optimization.

This tutorial provides an overview of the mutable object handling in data-parallel Julia programs. It also discusses the effect and analysis of false sharing which is a major performance pitfall when using in-place operations in a parallel program.

Table of contents:

Example: multiplying and adding matrices

Advanced: fusing multiplication and addition in base cases

Categorizing mutation use-cases

Filling outputs

Pitfalls with filling pre-allocated outputs

In-place reductions

Flexible reduction with FLoops.@reduce

@reduce(acc = op(init, input)) example

Ownership-passing style

@reduce() do example

General form of @reduce() do syntax

Ownership-passing style: second argument

Initializing mutable accumulator using Transducers.OnInit

Combining containers

OnlineStats

Pitfalls with mutable reduction states

Mutable temporary objects (private variables)

Accidental mutations

Advanced/Performance: False sharing

Analyzing false sharing using perf c2c

Advanced: adjoining trick

tkf · May 31, 2021, 1:17am

By the way, I’m by no means an expert on CPU hardware and have very little understanding of how the coherence protocol actually works. In particular, I don’t know how on earth certain false sharing does not happen in Intel and IBM CPUs. So, I appreciate if the experts comment on/fact check the false sharing part.

(Of course, comments on other points are also welcome too )

Skoffer · May 31, 2021, 9:13am

Why in all append examples tuple is used? For example, this

using FLoops

@floop for x in 1:10
    if isodd(x)
        @reduce(odds = append!(Int[], (x,)))
    else
        @reduce(evens = append!(Int[], (x,)))
    end
end

produces the same result as

@floop for x in 1:10
    if isodd(x)
        @reduce(odds = append!(Int[], x))
    else
        @reduce(evens = append!(Int[], x))
    end
end

Also

ys = Folds.mapreduce(tuple, withinit(() -> Int[], append!), 1:10; init = Init())

produces the same result as

ys = Folds.reduce(withinit(() -> Int[], append!), 1:10; init = Init())

so, why do we need this extra tuple conversion?

tkf · May 31, 2021, 9:55pm

That’s a good point that touches the corner of Julia I wish I can ignore Basically, the code without tuple-wrapping is working “by accident” (in some sense). It’s actually nothing to do with parallelism. Consider the implementations of collect with and without the tuple-wrapping;

julia> function collect1(xs)
           ys = Any[]
           for x in xs
               append!(ys, x)  # no tuple
           end
           return ys
       end;

julia> function collect2(xs)
           ys = Any[]
           for x in xs
               append!(ys, (x,))  # tuple
           end
           return ys
       end;

As you’ve observed, there is no difference if the input is a collection of numbers (which is a singleton collection of itself):

julia> collect1(1:3)
3-element Vector{Any}:
 1
 2
 3

julia> collect2(1:3)
3-element Vector{Any}:
 1
 2
 3

However, your version won’t work when each element is also a collection (note: Base.collect(1 => 2) == [1, 2]):

julia> collect1(x => x^2 for x in 1:3)
6-element Vector{Any}:
 1
 1
 2
 4
 3
 9

julia> collect2(x => x^2 for x in 1:3)
3-element Vector{Any}:
 1 => 1
 2 => 4
 3 => 9

In other words, you are observing that

for e in x
    ...
end

and

for e in (x,)
    ...
end

are equivalent if x isa Number.

So, in the examples using Int, you are correct that there is no need to use a tuple. However, since this is a tutorial, it is important to demonstrate the fundamental pattern in parallel computing:

Create a singleton solution (e.g., the singleton tuple)
Combine the solutions (e.g., append!)

I was hoping that explicitly constructing the singleton solution like (x,) helps the readers to familiarize themselves with this pattern.

Topic		Replies	Views
Mutable object in parallel loop General Usage	1	472	October 21, 2017
Parallelization of simple loop: reductions, thread-private variables? Julia at Scale	9	4226	September 30, 2017
Strange behavior parallel for General Usage	1	349	December 9, 2016
Why @parallel doing nothing? General Usage	11	1263	November 13, 2017
@distributed (op) for with mutable types New to Julia distributed , loops	9	954	May 6, 2020

Tutorial: Efficient and safe approaches to mutation in data parallelism

Related topics