I might have misunderstood something about the expected behavior of the bitwise AND or the @threads
macro, but I have observed something I think is odd.
Suppose we have a function that uses a for loop with @threads
to perform the bitwise AND of two bitarrays in an element-wise fashion. That should agree with the result of using a & b
, right? The toy example below disagrees with this when using @threads
.
function bitwise_and2(a::BitArray, b::BitArray)
n = length(a)
res = falses(n)
Threads.@threads for i = 1:n
res[i] = a[i] & b[i]
end
res
end
function trials(n, n_iter)
for i = 1:n_iter
a = bitrand(n)
b = bitrand(n)
y1 = bitwise_and2(a, b)
y2 = a & b
println(i)
@assert y2 == y1
end
end
trials(100, 1000) # throws AssertionError
Removing the @threads
macro from the bitwise_and2()
function seems to consistently give results that agree with a & b
.
Have I missed something obvious concerning @threads
or the bitwise AND? I would appreciate any guidance on what is causing this behavior.
What is versioninfo()
? I can’t reproduce this on macOS in either 0.5 or 0.6-dev.
Here is the versioninfo()
:
Julia Version 0.5.1
Commit 6445c82 (2017-03-05 13:25 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
I also tried this on a machine running Ubuntu 16.04 with Julia 0.5.0, with the same result.
Like @paulstey, I get AssertionError
s in this example. But I need to set JULIA_NUM_THREADS>1 before starting julia, otherwise Thread.nthreads()
will be 1 and no errors happen. Interestingly, the errors happen more easily (i.e. in fewer iterations) if n
is small. Happens on both 0.5.1 and latest master (Ubuntu 16.10 here).
My guess is that this is because in a BitArray
, different indices share the same storage location:
julia> sizeof(BitArray(256))
32
and maybe Threads
doesn’t take this into account. I guess a bug should be filed.
Short answer, this is not a bug. It’s an undefined behavior(UB) caused by race condition since different threads are storing to the same memory location. Like many other race condition, it may or may not work depending on the set up.
Long answer, from the semantics POV. This is racy because elements of bitarrays, like any bit objects, are not memory locations. We use a very similar memory model as C11/C++11 so you can find the precise definition of memory location and race condition here. Note that by the definition, a bit is not a memory location, scalar objects are. So when you are storing to the res
BitArray
, you are storing to the same memory location from multiple threads, causing a data race and therefore undefined behavior. OTOH, if you are operating on bytes or bigger objects, accessing different element of such object in an array is never a racy and the result is always well defined (well, assuming you don’t have UB else where). This is even true when the objects accessed by two threads are in the same cache line. That will only affect performance, not behavior.
From the hardware POV, this cannot be implemented for data since such hardware instructions doesn’t exist on any of the hardware we support (or any general purpose processors I’m aware of FWIW…). An access of a bit on all current CPUs are always load+mask or load+mask+store where the load and store are of at least byte size. Do note that this doesn’t mean this cannot be implemented for synchronization objects but such implemenetation will be crazily expensive and won’t be suitable for data access.
6 Likes
Also note that multi threaded element wise operation can be implemented for BitArray
s. But they have to operate on the underlying byte storate directly instead of on bits. Not sure if we exposes public interface for that.
@traktofon, thanks for confirming this, I appreciate it. And I should have mentioned that this requires the environment variable JULIA_NUM_THREADS
to be set greater than 1.
@yuyichao, thank you very much for the detailed explanation. It did not occur to me that a race condition might be the cause of this; but your explanation makes perfect sense. Thanks also for the cppreference link, it’s very illuminating!