As discussed in a quick introduction to data parallelism, data parallel style lets us write fast, portable, and generic parallel programs. One of the main focuses was to unlearn the “sequential idiom” that accumulates the result into mutable state. However, mutable state is sometimes preferred for efficiency. After all, a fast parallel program is typically a composition of fast sequential programs. Furthermore, managing mutable states is sometimes unavoidable for interoperability with libraries preferring or requiring mutation-based API. However, sharing mutable state is almost always a bad idea. Naively doing so likely results in data races and hence programs with undefined behaviors. Although low-level concurrency APIs such as locks and atomics can be used for writing (typically) inefficient but technically correct programs, a better approach is to use single-owner local mutable state. In particular, we will see that unlearning sequential idiom was worth the effort since it points us to what we call ownership-passing style that can be used to construct mutation-based parallel reduction from mutation-free (“purely functional”) reduction as an optimization.
This tutorial provides an overview of the mutable object handling in data-parallel Julia programs. It also discusses the effect and analysis of false sharing which is a major performance pitfall when using in-place operations in a parallel program.
By the way, I’m by no means an expert on CPU hardware and have very little understanding of how the coherence protocol actually works. In particular, I don’t know how on earth certain false sharing does not happen in Intel and IBM CPUs. So, I appreciate if the experts comment on/fact check the false sharing part.
(Of course, comments on other points are also welcome too )
That’s a good point that touches the corner of Julia I wish I can ignore Basically, the code without tuple-wrapping is working “by accident” (in some sense). It’s actually nothing to do with parallelism. Consider the implementations of collect with and without the tuple-wrapping;
julia> function collect1(xs)
ys = Any[]
for x in xs
append!(ys, x) # no tuple
end
return ys
end;
julia> function collect2(xs)
ys = Any[]
for x in xs
append!(ys, (x,)) # tuple
end
return ys
end;
As you’ve observed, there is no difference if the input is a collection of numbers (which is a singleton collection of itself):
However, your version won’t work when each element is also a collection (note: Base.collect(1 => 2) == [1, 2]):
julia> collect1(x => x^2 for x in 1:3)
6-element Vector{Any}:
1
1
2
4
3
9
julia> collect2(x => x^2 for x in 1:3)
3-element Vector{Any}:
1 => 1
2 => 4
3 => 9
In other words, you are observing that
for e in x
...
end
and
for e in (x,)
...
end
are equivalent if x isa Number.
So, in the examples using Int, you are correct that there is no need to use a tuple. However, since this is a tutorial, it is important to demonstrate the fundamental pattern in parallel computing:
Create a singleton solution (e.g., the singleton tuple)
Combine the solutions (e.g., append!)
I was hoping that explicitly constructing the singleton solution like (x,) helps the readers to familiarize themselves with this pattern.