Multi-threading changing results

Joao_Barata · August 10, 2020, 10:16am

Hello,

I’m doing some parallel loops in my code using multi-threading. As I was reviewing my results, I noticed that I kept having different results in a piece of code, although the input data was always the same. So instead of having a parallel version, I moved to a serial version, which confirmed that multi-threading was somehow producing random results.

Here is a MWV of what i’m doing:

Sim = randn(50,12,10,2)
lev = zeros(10)
for ij = 1:10
    suma = 0.0
    counta = 0.0
    Threads.@threads for i = 1:12
        for j = 1:50
            if Sim[j,i,ij,1] > 0.05 && Sim[j,i,ij,2] > 0.0
                suma = suma + Sim[j,i,ij,1]/Sim[j,i,ij,2]
                counta = counta + 1
            end
        end
    end
    lev[ij] = suma/counta
end

After running Sim once and then running the loop multiple times, one gets a different result every time, for some reason. Not so in the serial version. Does anybody have any explanation for this?

fbanning · August 10, 2020, 10:34am

After running Sim once and then running the loop multiple times, one gets a different result every time

Cannot reproduce this:

julia> Sim = randn(50,12,10,2)
50×12×10×2 Array{Float64,4}:
LOTS OF NUMBERS

julia> lev = zeros(10)
10-element Array{Float64,1}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0

julia> for ij = 1:10
          suma = 0.0
          counta = 0.0
          Threads.@threads for i = 1:12
              for j = 1:50
                  if Sim[j,i,ij,1] > 0.05 && Sim[j,i,ij,2] > 0.0
                      suma = suma + Sim[j,i,ij,1]/Sim[j,i,ij,2]
                                      counta = counta + 1
                  end
              end
          end
          lev[ij] = suma/counta
      end

julia> lev
10-element Array{Float64,1}:
3.602256205117252
6.076811055555966
4.482345464058883
2.6988466255185792
4.717570071596839
4.040129042647047
5.85015237256851
2.3409702687463794
6.324949754983045
3.2099170858509183

julia> for ij = 1:10
          suma = 0.0
          counta = 0.0
          Threads.@threads for i = 1:12
              for j = 1:50
                  if Sim[j,i,ij,1] > 0.05 && Sim[j,i,ij,2] > 0.0
                      suma = suma + Sim[j,i,ij,1]/Sim[j,i,ij,2]
                                      counta = counta + 1
                  end
              end
          end
          lev[ij] = suma/counta
      end

julia> lev
10-element Array{Float64,1}:
3.602256205117252
6.076811055555966
4.482345464058883
2.6988466255185792
4.717570071596839
4.040129042647047
5.85015237256851
2.3409702687463794
6.324949754983045
3.2099170858509183

LaurentPlagne · August 10, 2020, 10:39am

It seems that you are performing a reduction on suma then you probably encounter data races.
You may accumulate on a different variable on each thread (or use atomic).
https://docs.julialang.org/en/v1/manual/multi-threading/#Atomic-Operations-1

Joao_Barata · August 10, 2020, 10:44am

I can show you what it looks like in mine:

lev
10-element Array{Float64,1}:
 3.7538904363191454
 1.4806699022971184
 3.380346307030155
 1.5777107005713031
 ⋮
 1.3869407970389027
 0.9888372404478836
 5.494930442090792
 1.5369875031486322

second run:

lev
10-element Array{Float64,1}:
 8.871407349458194
 2.305630998574381
 2.9425926928342405
 2.682961326061693
 ⋮
 1.6146984391226809
 1.5934577478658278
 2.603503063167422
 2.510691292496566

LaurentPlagne · August 10, 2020, 10:48am

Mind try

suma = Threads.Atomic{Float64}(0); 
counta =Threads.Atomic{Int}(0);

?

DNF · August 10, 2020, 10:53am

I’m not an expert (to say the least), but won’t it be much faster to aggregate into an array ~~or tuple~~? I seem to remember that atomics are slow.

Edit: I guess tuples, being immutable, aren’t suited.

LaurentPlagne · August 10, 2020, 11:06am

You are right and it was my first suggestion. I propose atomics because the fix is easier can confirm data race.

BTW this kind of bugs can be really awful because they may happen rarely. Explicit thread programming is really hard and should be avoided by using parallel patterns when possible.

DNF · August 10, 2020, 11:14am

Moving the threading to the outermost loop removes the data races, and also runs twice as fast (on my laptop):

function bar!(S, lev)
    Threads.@threads for ij in axes(S, 3)
        suma = 0.0
        counta = 0
        for i in axes(S, 2)
            for j in axes(S, 1)
                if S[j,i,ij,1] > 0.05 && S[j,i,ij,2] > 0
                    suma += S[j,i,ij,1] / S[j,i,ij,2]
                    counta += 1
                end
            end
        end
        lev[ij] = suma / counta
    end
    return lev
end

Edit: Sorry! Bad typo in original code!!

The speedup of moving the @threads macro call is actually >12x.

Joao_Barata · August 10, 2020, 12:47pm

Thanks that’s really helpful! That formulation makes much more sense.

Joao_Barata · August 10, 2020, 12:48pm

I’m sorry, I’m not familiar with the terms you’re using. What is the difference between explicit thread programming and parallel patterns?

LaurentPlagne · August 10, 2020, 1:35pm

My bad: I was rather unclear especially in the Julia context.

Parallel patterns are parallel versions of high order functions (functions of function) like map,reduce,scan,foldl/r… You can find good example of these in C++ TBBs or even better in the fantastic (but confidential) Futhark.

a parallel map operation
parallel_map(f, c...) -> collection with f being a pure function (with no side effect and no mutation on c....)
a parallel reduce operation
parallel_reduce(op, itr) -> result with op being a pure operator and no mutation on c....)

Using such constructs requires the user to identify what he is doing (a map, a reduction, a scan…) and to ensure that the passed function/operator has no side effects.

The first advantage is that these parallel higher order functions can be carefully designed and tuned and reused in multiple contexts.
The second advantage is that it forces the user to think about its computation (I am doing a reduction ?).
Finally, it could help to evolve toward distributed computing or GPU computing more easily.

IMHO, the biggest problem is to ensure that a function has no side effect in Julia. This is why I open a new thread on this topic.

I think that I have seen somewhere (where ?) that some Julia core dev were considering implicit parallelism (a standard map could be executed in parallel). This would be a crazy idea in other languages because you can never be sure that the passed function is not parallel. Since Julia looks exceptionally good at handling nested parallelism it may me sound less crazy… but the problem of possible function side effects remain.

Explicit thread programming:
On the contrary, it looks very easy (too easy) to add a @thread on a parallel loop to see if this brings some speed-up. It often does, and it is also likely that potential data races remain silent for a while because the user didn’t even consider this potential issue. Spawning task can result in very difficult bugs (e.g. random deadlocks).

Because Julia gathers users from different community (math/science and CS) (which is trully wonderful), it will expose more users to these advanced technical difficulties. Hence my concern for strong warning in the documentation and parallel debug tools.

Joao_Barata · August 10, 2020, 3:05pm

Thanks for the extremely detailed and useful answer. It is indeed extremely easy to just put @threads in front of loops for easy speed gains.

Edit: at the expense of possible (nearly) invisible errors, of course.

eaubanel · August 10, 2020, 3:19pm

I agree with the need of a strong warning. Parallel programming with language extensions, such as OpenMP or CUDA, makes it clear to the programmer that they’re doing something different, and there’s tons of supporting documentation. In Julia, it’s almost too easy to slap on @threads to a loop. More examples of good uses of thread parallelization would be useful. Steering programmers towards implicit parallelism is also a good idea. It would be nice to have a good set of concurrent data structures that programmers could use safely, something like Java’s concurrent collections.

Joao_Barata · August 12, 2020, 6:54pm

Does anybody have a suggestion of a good source to read up on the issue of parallel patterns in Julia, given the post by @LaurentPlagne? The way that this is dealt with the the official Julia documentation is, from a beginner’s point of view, not sufficient both in terms of examples and in terms of the presentation of the tools of multi-threading to a person who wants to begin using that feature (or parallel computing in general). Thanks

LaurentPlagne · August 13, 2020, 12:52am

Hi Joao, I think that @mbauman lecture on parallel computing in Julia is quite nice (https://juliaacademy.com).

Please forgive me for the following non solicited piece of advice (I spent many years teaching HPC in an industrial context and often had strong difficulties to warn against early use of parallel programming). They are probably irrelevant to your situation.

I would advice to spend some time to learn about code optimization and computer architecture (in general (and in Julia with the performance tips of the documentation)) before tackling parallel computing. One should have a clear knowledge of what are memory bound and compute bound problems. About the memory wall and the roof-line model. Have a basic idea about how many cycles are necessary for +,-,*,/,sqrt,exp,… operations on common numerical types, for access to contiguous and non contiguous piece of memory. Basic knowledge of cache hierarchy. Some knowledge of the cost of (predictable, and unpredictable) code branches.

If these notions are clear , then one should be able to assess the (absolute) performance of a given algorithm implementation and decide if (s)he should spend some more time to optimize it, look for better algorithm or … enter parallel programming.

Concerning your initial data race problem, the wikipedia article is quite complete (I just learned that Go have a tool to detect data races).

What is sometime poorly foreseen is that, although these kind of bugs are simple to explain and to spot on small examples, their non-deterministic nature makes them extremely dangerous at scale. If you consider to add a @thread on a loop calling a function, then you should be able to prove that the function is thread safe. For example, a simple IO operation, or a tiny counter inside the function can be forgotten and you have built a ticking bomb. Or the function is really thread safe and have locks on these IO or counter and you will spend time to understand why the SpeedUp is not here (much better situation).

Parallel computing is really interesting and useful but it is rather hard and shared memory parallelism is in my opinion quite dangerous for developers (and users) and as such should be handled with great care.

I think that @arch.d.robison have made important contribution to simd introduction in Julia and designed the C++ Intel’s Threading building blocks. Maybe he could add something one safe MT programming in Julia

Joao_Barata · August 13, 2020, 3:31pm

Thanks for the excellent suggestions Laurent. They are extremely useful for a beginner like me.

LaurentPlagne · August 20, 2020, 6:38am

I just read this thread from @tkf who provides neat packages for Julia parallel patterns.

tkf · August 21, 2020, 3:54am

FYI FLoops.jl is trying to provide a “user-friendly” interface to parallel (and sequential) loops. In particular, it can detect the error as in the OP:

(But I should say that I had to fix two bugs for this example to work. FLoops.jl is a quite new package. It may find your mistakes but it may contain its own bugs .)

A data race-free implementation using FLoops.jl is something like

function demo()
    Sim = randn(50,12,10,2)
    lev = zeros(10)
    for ij = 1:10
        @floop for i = 1:12
            for j = 1:50
                if Sim[j,i,ij,1] > 0.05 && Sim[j,i,ij,2] > 0.0
                    @reduce(
                        suma = 0.0 + Sim[j,i,ij,1]/Sim[j,i,ij,2],
                        counta = 0.0 + 1,
                    )
                end
            end
        end
        lev[ij] = suma/counta
    end
    Sim
end

LaurentPlagne · August 21, 2020, 6:51am

I did not knew about this data race diagnostic feature: awesome !!!

Topic		Replies	Views
Multi-threading on a 2 CPU system New to Julia multithreading	15	1143	February 2, 2023
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4036	February 11, 2019
Slowdown when multi-threading is used General Usage	8	2235	January 19, 2017
Data structures for threaded computing Performance	23	2909	October 23, 2019
Parallelizing for loop in the computation of a gradient Performance question	19	2624	February 26, 2018

Multi-threading changing results

Related topics