Help understanding what happens in a `@threads` loop with race conditions

ExpandingMan · April 14, 2020, 7:35pm

It tried something silly that went wrong, and I’m trying to get a better understanding of exactly why this happens. Consider the following

using Base.Threads

function mask!(v, M)
    @threads for j ∈ 1:size(M,2)
        for i ∈ 1:size(M,1)
            iszero(v[i]) && continue
            v[i] *= M[i,j]
        end
    end
    v
end


const n = 10^6

v = ones(Int, n)
M = rand(0:1, n, 10)

mask!(v, M)

Clearly there are race conditions in this loop. My instinct was that it should, nevertheless, always return the same results, my reasoning was the following

For each element in v we are checking whether any of the corresponding elements of M are zero. If they are, we set the element of v to zero that’s it.
Once an element of v is set to zero, there should be no way to change it back.

Of course I should never have even considered doing anything like this in the first place, I’m not posting here to say that I want to do this, I’m posting here to try to understand exactly what happens.

My current guess is that the elements of v are getting cached somewhere (perhaps in L1 or L2) so when other threads set them to 0 this can sometimes get missed.

oheil · April 14, 2020, 7:52pm

For a given v and M, the results are identical if run with or without @threads.
v[i] is not 0 => v[i] = 1 * M[i,j]
but M[i,j] can be 1 for all j , so v[i] stays 1

Correct me if I missed something in your algorithm.

julia> findall(!iszero,v)
951-element Array{Int64,1}:
   1543
   2694
...
julia> M[1543,:]
10-element Array{Int64,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

oheil · April 14, 2020, 7:57pm

Or better, can you elaborate more about the error case you have?
(edited above, because I over interpreted, thought you expect all in v to be 0)

ExpandingMan · April 14, 2020, 8:03pm

This is definitely not true.

function test(seed=999, n=n, m=10)
    v = ones(Int, n)
    Random.seed!(seed)
    M = rand(0:1, n, m)
    sum(mask!(v, M))
end

With `@threads`

julia> test()
3180

julia> test()
3787

julia> test()
1955

julia> test()
3822

julia> test()
2080

Without `@threads`

julia> test()
954

julia> test()
954

julia> test()
954

julia> test()
954

julia> test()
954

Again, my question is more “what’s happening” not so much “how do I do this”. If I had to write this again (my original use case was more complicated, but the above is an MWE), I would make sure that each element of v is owned by exactly 1 thread.

Yes, that’s right, but my point was that if any of the M are 0, then v[i] should be set to 0. Only in cases where M[i,j] is one for all j do I expect v[i] to be non-zero. This works without @threads, but fails with it.

Note that the following code correctly gives the same result as the sequential version of mask!

function mask2!(v, M)
    @threads for i ∈ 1:length(v)
        all(m > 0 for m ∈ M[i,:]) && continue
        v[i] = 0
    end
    v
end

oheil · April 14, 2020, 8:25pm

You are right, i was able to reproduce your outcoming.

Is it possible, that the data for the concurrent threads is split over index j, which is columns (10), so the threads don’t see the complete row for v[i] *= M[i,j]?
This is the race condition your are seeing?
So, I would say, the random results are to be expected. What ever thread is latest for a given v[i] decides if v[i] is 0 or 1. Nothing missed here in any cache. Its just the algorithm which is not thread safe in above example.
If you change the order with:

    @threads for i in 1:size(M,1)
        for j in 1:size(M,2)

it is thread safe and the results are always equal to the non threaded version.

v[i] can be set to 0 by thread 1 and thread 2 still sees the 1. Yes this is because of caching.

ExpandingMan · April 14, 2020, 8:40pm

Yeah, this is my suspicion as well, it would be great if one of the compiler guys could come along and give more explicit details on how Julia is deciding to cache those values, if such details are even knowable.

oheil · April 14, 2020, 8:46pm

Yes would be interesting, but I don’t think that compilers decide about those caches (L1,L2,L3).

oheil · April 14, 2020, 9:37pm

Here is a good starting point… I just followed some links and took an overview, but it is something I won’t go to deep, too time consuming and not so much to gain (except for the knowledge):
https://stackoverflow.com/questions/40292928/how-to-use-read-write-cpu-caches-l1-l2-l3

tkf · April 15, 2020, 1:19am

(I’m just a curious learner on this topic so please don’t consider what I say as an expert opinion. Having said that…)

I think this is expected. Although Julia does not specify its memory model, I guess it’s safe to assume it’s aiming to provide sequential consistency for data race-free programs (SC for DRF), like C, C++ and Java do. In that case, as soon as you write a data race, your program’s behavior is undefined and you cannot expect any sane behavior. This is because, unless you use some synchronization protocol (e.g., atomics), the compiler and the hardware (CPU/cache) are free to transform (optimize) your program to something else as long as the difference is not observable in single threaded execution. For example, the compiler might change the inner loop to load four elements from v at the time, do v[i] *= M[i,j], and then write back all four elements even if they are not modified.

FYI, I think this two-part talk is a nice introduction to this topic. In fact, my reply is solely based on what I learnt from it:

To write mask!, I’d just use threaded map over the first axis of M, even though it makes the memory access of the inner loop non-contiguous. Of course, it’d be nice to store the transposed version of M if other parts of the program are OK with it.

angelv · April 15, 2020, 9:22am

Hi,

v[i] *= M[i,j]

is not an atomic operation, so this is expected. For simplicity, imagine you only have two threads, and M is just 1x4: [1 0 0 1]

Since you split the work in threads by columns, thread 1 will take care of the first two columns [1 0] and the second thread of the other columns [0 1].

Assume each thread is working with their first element. Each one sees that v[1] is not zero and they compute the value v[1]*M[i,j]. For the first thread this ill be 1 and for the second thread it will be 0, but since the operation is not atomic, now the final value of v[1] could be 0 or 1, depending on who gets faster to the store operation.

Compare these two functions (modified from your code to just work on vectors, and working with addition instead). The first one, following your approach exhibits the same problem. Using mask_t (with atomic_add!), the problem goes away:

using Base.Threads
using Random

function mask!(v, M)
    @threads for i ∈ 1:length(M)
        v += M[i]
    end
    v
end

function mask_t!(v, M)
    r = Atomic{eltype(v)}(one(eltype(v)))
    @threads for i ∈ 1:length(M)
        atomic_add!(r,M[i])
    end
    r[]
end


function test(seed=999, n=10^6)
    println("Number of threads: ", nthreads())
    v = one(Int)
    Random.seed!(seed)
    M = rand(0:1, n)
    println(sum(mask!(v, M)))
    println(sum(mask_t!(v, M)))
end

julia> test()
Number of threads: 4
126877
500543

julia> test()
Number of threads: 4
126268
500543

julia> test()
Number of threads: 4
126882
500543

Cheers,
AdV

pixel27 · April 15, 2020, 12:38pm

This would be the incorrect assumption, since the “test and set” is NOT protected operation multiple threads to could do the test at the same time which means they will all do the set…not sure who the “winner” would be. And to make matters worse…if they are using a cached value in the test, the threads don’t even have to be testing at the same time, they could be “testing” old data.

ExpandingMan · April 15, 2020, 1:31pm

Thanks for the information all. Sounds like the issue is pretty much exactly what I thought it was.

tkf · April 15, 2020, 3:06pm

You might have brought up atomic as a demonstration. But, just for completeness, let me mention that using atomic for this kind of computation is not an efficient approach (although it yields the correct answer). It’s a good idea to use threaded reduce whenever possible.

angelv · April 15, 2020, 10:03pm

Sure. @ExpandingMan clearly stated in the original post that the idea was just to understand what was going on in his original code. My post didn’t focus at all on performance, just tried to explain that the algorithm, as it was, would not work unless atomic operations were used.

Topic		Replies	Views
Inconsistent results when using Threads.@threads in a loop General Usage multithreading	17	866	June 10, 2023
Odd behavior from @threads with bitwise AND General Usage	6	1030	April 13, 2017
Multi-threading changing results New to Julia	18	3713	August 21, 2020
Data structures for threaded computing Performance	23	2843	October 23, 2019
Why does Julia threads in the following way? General Usage	6	236	January 20, 2025

Help understanding what happens in a `@threads` loop with race conditions

With @threads

Without @threads

Related topics

With `@threads`

Without `@threads`