It tried something silly that went wrong, and I’m trying to get a better understanding of exactly why this happens. Consider the following
using Base.Threads
function mask!(v, M)
@threads for j ∈ 1:size(M,2)
for i ∈ 1:size(M,1)
iszero(v[i]) && continue
v[i] *= M[i,j]
end
end
v
end
const n = 10^6
v = ones(Int, n)
M = rand(0:1, n, 10)
mask!(v, M)
Clearly there are race conditions in this loop. My instinct was that it should, nevertheless, always return the same results, my reasoning was the following
For each element in v we are checking whether any of the corresponding elements of M are zero. If they are, we set the element of v to zero that’s it.
Once an element of v is set to zero, there should be no way to change it back.
Of course I should never have even considered doing anything like this in the first place, I’m not posting here to say that I want to do this, I’m posting here to try to understand exactly what happens.
My current guess is that the elements of v are getting cached somewhere (perhaps in L1 or L2) so when other threads set them to 0 this can sometimes get missed.
For a given v and M, the results are identical if run with or without @threads.
v[i] is not 0 => v[i] = 1 * M[i,j]
but M[i,j] can be 1 for all j , so v[i] stays 1
Correct me if I missed something in your algorithm.
Again, my question is more “what’s happening” not so much “how do I do this”. If I had to write this again (my original use case was more complicated, but the above is an MWE), I would make sure that each element of v is owned by exactly 1 thread.
Yes, that’s right, but my point was that if any of the M are 0, then v[i] should be set to 0. Only in cases where M[i,j] is one for all j do I expect v[i] to be non-zero. This works without @threads, but fails with it.
Note that the following code correctly gives the same result as the sequential version of mask!
function mask2!(v, M)
@threads for i ∈ 1:length(v)
all(m > 0 for m ∈ M[i,:]) && continue
v[i] = 0
end
v
end
You are right, i was able to reproduce your outcoming.
Is it possible, that the data for the concurrent threads is split over index j, which is columns (10), so the threads don’t see the complete row for v[i] *= M[i,j]?
This is the race condition your are seeing?
So, I would say, the random results are to be expected. What ever thread is latest for a given v[i] decides if v[i] is 0 or 1. Nothing missed here in any cache. Its just the algorithm which is not thread safe in above example.
If you change the order with:
@threads for i in 1:size(M,1)
for j in 1:size(M,2)
it is thread safe and the results are always equal to the non threaded version.
v[i] can be set to 0 by thread 1 and thread 2 still sees the 1. Yes this is because of caching.
Yeah, this is my suspicion as well, it would be great if one of the compiler guys could come along and give more explicit details on how Julia is deciding to cache those values, if such details are even knowable.
(I’m just a curious learner on this topic so please don’t consider what I say as an expert opinion. Having said that…)
I think this is expected. Although Julia does not specify its memory model, I guess it’s safe to assume it’s aiming to provide sequential consistency for data race-free programs (SC for DRF), like C, C++ and Java do. In that case, as soon as you write a data race, your program’s behavior is undefined and you cannot expect any sane behavior. This is because, unless you use some synchronization protocol (e.g., atomics), the compiler and the hardware (CPU/cache) are free to transform (optimize) your program to something else as long as the difference is not observable in single threaded execution. For example, the compiler might change the inner loop to load four elements from v at the time, do v[i] *= M[i,j], and then write back all four elements even if they are not modified.
FYI, I think this two-part talk is a nice introduction to this topic. In fact, my reply is solely based on what I learnt from it:
To write mask!, I’d just use threaded map over the first axis of M, even though it makes the memory access of the inner loop non-contiguous. Of course, it’d be nice to store the transposed version of M if other parts of the program are OK with it.
is not an atomic operation, so this is expected. For simplicity, imagine you only have two threads, and M is just 1x4: [1 0 0 1]
Since you split the work in threads by columns, thread 1 will take care of the first two columns [1 0] and the second thread of the other columns [0 1].
Assume each thread is working with their first element. Each one sees that v[1] is not zero and they compute the value v[1]*M[i,j]. For the first thread this ill be 1 and for the second thread it will be 0, but since the operation is not atomic, now the final value of v[1] could be 0 or 1, depending on who gets faster to the store operation.
Compare these two functions (modified from your code to just work on vectors, and working with addition instead). The first one, following your approach exhibits the same problem. Using mask_t (with atomic_add!), the problem goes away:
using Base.Threads
using Random
function mask!(v, M)
@threads for i ∈ 1:length(M)
v += M[i]
end
v
end
function mask_t!(v, M)
r = Atomic{eltype(v)}(one(eltype(v)))
@threads for i ∈ 1:length(M)
atomic_add!(r,M[i])
end
r[]
end
function test(seed=999, n=10^6)
println("Number of threads: ", nthreads())
v = one(Int)
Random.seed!(seed)
M = rand(0:1, n)
println(sum(mask!(v, M)))
println(sum(mask_t!(v, M)))
end
julia> test()
Number of threads: 4
126877
500543
julia> test()
Number of threads: 4
126268
500543
julia> test()
Number of threads: 4
126882
500543
This would be the incorrect assumption, since the “test and set” is NOT protected operation multiple threads to could do the test at the same time which means they will all do the set…not sure who the “winner” would be. And to make matters worse…if they are using a cached value in the test, the threads don’t even have to be testing at the same time, they could be “testing” old data.
You might have brought up atomic as a demonstration. But, just for completeness, let me mention that using atomic for this kind of computation is not an efficient approach (although it yields the correct answer). It’s a good idea to use threaded reduce whenever possible.
Sure. @ExpandingMan clearly stated in the original post that the idea was just to understand what was going on in his original code. My post didn’t focus at all on performance, just tried to explain that the algorithm, as it was, would not work unless atomic operations were used.