Hello,

I’m doing some parallel loops in my code using multi-threading. As I was reviewing my results, I noticed that I kept having different results in a piece of code, although the input data was always the same. So instead of having a parallel version, I moved to a serial version, which confirmed that multi-threading was somehow producing random results.

Here is a MWV of what i’m doing:

``````Sim = randn(50,12,10,2)
lev = zeros(10)
for ij = 1:10
suma = 0.0
counta = 0.0
for j = 1:50
if Sim[j,i,ij,1] > 0.05 && Sim[j,i,ij,2] > 0.0
suma = suma + Sim[j,i,ij,1]/Sim[j,i,ij,2]
counta = counta + 1
end
end
end
lev[ij] = suma/counta
end
``````

After running Sim once and then running the loop multiple times, one gets a different result every time, for some reason. Not so in the serial version. Does anybody have any explanation for this?

1 Like

After running Sim once and then running the loop multiple times, one gets a different result every time

Cannot reproduce this:

``````julia> Sim = randn(50,12,10,2)
50×12×10×2 Array{Float64,4}:
LOTS OF NUMBERS

julia> lev = zeros(10)
10-element Array{Float64,1}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

julia> for ij = 1:10
suma = 0.0
counta = 0.0
for j = 1:50
if Sim[j,i,ij,1] > 0.05 && Sim[j,i,ij,2] > 0.0
suma = suma + Sim[j,i,ij,1]/Sim[j,i,ij,2]
counta = counta + 1
end
end
end
lev[ij] = suma/counta
end

julia> lev
10-element Array{Float64,1}:
3.602256205117252
6.076811055555966
4.482345464058883
2.6988466255185792
4.717570071596839
4.040129042647047
5.85015237256851
2.3409702687463794
6.324949754983045
3.2099170858509183

julia> for ij = 1:10
suma = 0.0
counta = 0.0
for j = 1:50
if Sim[j,i,ij,1] > 0.05 && Sim[j,i,ij,2] > 0.0
suma = suma + Sim[j,i,ij,1]/Sim[j,i,ij,2]
counta = counta + 1
end
end
end
lev[ij] = suma/counta
end

julia> lev
10-element Array{Float64,1}:
3.602256205117252
6.076811055555966
4.482345464058883
2.6988466255185792
4.717570071596839
4.040129042647047
5.85015237256851
2.3409702687463794
6.324949754983045
3.2099170858509183
``````

It seems that you are performing a reduction on `suma` then you probably encounter data races.
You may accumulate on a different variable on each thread (or use atomic).

2 Likes

I can show you what it looks like in mine:

``````lev
10-element Array{Float64,1}:
3.7538904363191454
1.4806699022971184
3.380346307030155
1.5777107005713031
⋮
1.3869407970389027
0.9888372404478836
5.494930442090792
1.5369875031486322
``````

second run:

``````lev
10-element Array{Float64,1}:
8.871407349458194
2.305630998574381
2.9425926928342405
2.682961326061693
⋮
1.6146984391226809
1.5934577478658278
2.603503063167422
2.510691292496566
``````

Mind try

``````suma = Threads.Atomic{Float64}(0);
``````

?

I’m not an expert (to say the least), but won’t it be much faster to aggregate into an array or tuple? I seem to remember that atomics are slow.

Edit: I guess tuples, being immutable, aren’t suited.

You are right and it was my first suggestion. I propose atomics because the fix is easier can confirm data race.

BTW this kind of bugs can be really awful because they may happen rarely. Explicit thread programming is really hard and should be avoided by using parallel patterns when possible.

1 Like

Moving the threading to the outermost loop removes the data races, and also runs twice as fast (on my laptop):

``````function bar!(S, lev)
suma = 0.0
counta = 0
for i in axes(S, 2)
for j in axes(S, 1)
if S[j,i,ij,1] > 0.05 && S[j,i,ij,2] > 0
suma += S[j,i,ij,1] / S[j,i,ij,2]
counta += 1
end
end
end
lev[ij] = suma / counta
end
return lev
end
``````

Edit: Sorry! Bad typo in original code!!

The speedup of moving the `@threads` macro call is actually >12x.

6 Likes

Thanks that’s really helpful! That formulation makes much more sense.

I’m sorry, I’m not familiar with the terms you’re using. What is the difference between explicit thread programming and parallel patterns?

My bad: I was rather unclear especially in the Julia context.

Parallel patterns are parallel versions of high order functions (functions of function) like `map`,`reduce`,`scan`,`foldl/r`… You can find good example of these in C++ TBBs or even better in the fantastic (but confidential) Futhark.

• a parallel map operation
`parallel_map(f, c...) -> collection` with f being a pure function (with no side effect and no mutation on `c...`.)
• a parallel reduce operation
`parallel_reduce(op, itr) -> result` with `op` being a pure operator and no mutation on `c...`.)

Using such constructs requires the user to identify what he is doing (a map, a reduction, a scan…) and to ensure that the passed function/operator has no side effects.

• The first advantage is that these parallel higher order functions can be carefully designed and tuned and reused in multiple contexts.

• The second advantage is that it forces the user to think about its computation (I am doing a reduction ?).

• Finally, it could help to evolve toward distributed computing or GPU computing more easily.

IMHO, the biggest problem is to ensure that a function has no side effect in Julia. This is why I open a new thread on this topic.

I think that I have seen somewhere (where ?) that some Julia core dev were considering implicit parallelism (a standard map could be executed in parallel). This would be a crazy idea in other languages because you can never be sure that the passed function is not parallel. Since Julia looks exceptionally good at handling nested parallelism it may me sound less crazy… but the problem of possible function side effects remain.

On the contrary, it looks very easy (too easy) to add a @thread on a parallel loop to see if this brings some speed-up. It often does, and it is also likely that potential data races remain silent for a while because the user didn’t even consider this potential issue. Spawning task can result in very difficult bugs (e.g. random deadlocks).

Because Julia gathers users from different community (math/science and CS) (which is trully wonderful), it will expose more users to these advanced technical difficulties. Hence my concern for strong warning in the documentation and parallel debug tools.

3 Likes

Thanks for the extremely detailed and useful answer. It is indeed extremely easy to just put @threads in front of loops for easy speed gains.

Edit: at the expense of possible (nearly) invisible errors, of course.

I agree with the need of a strong warning. Parallel programming with language extensions, such as OpenMP or CUDA, makes it clear to the programmer that they’re doing something different, and there’s tons of supporting documentation. In Julia, it’s almost too easy to slap on `@threads` to a loop. More examples of good uses of thread parallelization would be useful. Steering programmers towards implicit parallelism is also a good idea. It would be nice to have a good set of concurrent data structures that programmers could use safely, something like Java’s concurrent collections.

5 Likes

Does anybody have a suggestion of a good source to read up on the issue of parallel patterns in Julia, given the post by @LaurentPlagne? The way that this is dealt with the the official Julia documentation is, from a beginner’s point of view, not sufficient both in terms of examples and in terms of the presentation of the tools of multi-threading to a person who wants to begin using that feature (or parallel computing in general). Thanks

2 Likes

Hi Joao, I think that @mbauman lecture on parallel computing in Julia is quite nice (https://juliaacademy.com).

Please forgive me for the following non solicited piece of advice (I spent many years teaching HPC in an industrial context and often had strong difficulties to warn against early use of parallel programming). They are probably irrelevant to your situation.

I would advice to spend some time to learn about code optimization and computer architecture (in general (and in Julia with the performance tips of the documentation)) before tackling parallel computing. One should have a clear knowledge of what are memory bound and compute bound problems. About the memory wall and the roof-line model. Have a basic idea about how many cycles are necessary for +,-,*,/,sqrt,exp,… operations on common numerical types, for access to contiguous and non contiguous piece of memory. Basic knowledge of cache hierarchy. Some knowledge of the cost of (predictable, and unpredictable) code branches.

If these notions are clear , then one should be able to assess the (absolute) performance of a given algorithm implementation and decide if (s)he should spend some more time to optimize it, look for better algorithm or … enter parallel programming.

Concerning your initial data race problem, the wikipedia article is quite complete (I just learned that Go have a tool to detect data races).

What is sometime poorly foreseen is that, although these kind of bugs are simple to explain and to spot on small examples, their non-deterministic nature makes them extremely dangerous at scale. If you consider to add a @thread on a loop calling a function, then you should be able to prove that the function is thread safe. For example, a simple IO operation, or a tiny counter inside the function can be forgotten and you have built a ticking bomb. Or the function is really thread safe and have locks on these IO or counter and you will spend time to understand why the SpeedUp is not here (much better situation).

Parallel computing is really interesting and useful but it is rather hard and shared memory parallelism is in my opinion quite dangerous for developers (and users) and as such should be handled with great care.

I think that @arch.d.robison have made important contribution to simd introduction in Julia and designed the C++ Intel’s Threading building blocks. Maybe he could add something one safe MT programming in Julia

8 Likes

Thanks for the excellent suggestions Laurent. They are extremely useful for a beginner like me.

I just read this thread from @tkf who provides neat packages for Julia parallel patterns.

FYI FLoops.jl is trying to provide a “user-friendly” interface to parallel (and sequential) loops. In particular, it can detect the error as in the OP:

(But I should say that I had to fix two bugs for this example to work. FLoops.jl is a quite new package. It may find your mistakes but it may contain its own bugs .)

A data race-free implementation using FLoops.jl is something like

``````function demo()
Sim = randn(50,12,10,2)
lev = zeros(10)
for ij = 1:10
@floop for i = 1:12
for j = 1:50
if Sim[j,i,ij,1] > 0.05 && Sim[j,i,ij,2] > 0.0
@reduce(
suma = 0.0 + Sim[j,i,ij,1]/Sim[j,i,ij,2],
counta = 0.0 + 1,
)
end
end
end
lev[ij] = suma/counta
end
Sim
end
``````
3 Likes