I have a simple function which, given a vector of floats x, returns a BitVector that has 1βs in places where the entry is nonzero. In other words, f(x) = (x .!=0.0). When I parallelize it, I get incorrect results:

using Random
x = rand([1.0,2.0,3.0],10)
function f(x, prl)
n = length(x)
y = falses(n)
if prl==0 for i=1:n y[i] = (x[i] !=0.0) end end
if prl==1 Threads.@threads for i=1:n y[i] = (x[i] !=0.0) end end
return y end
y0 = trues(length(x))
for k=1:100 if y0!=f(x,0) print("!") end end
#
for k=1:100 if y0!=f(x,1) print("!") end end
# !!!!!!!!

Why does this happen? What am I doing wrong? Each thread writes in a different place in y.

help?> ?BitArray
search: BitArray
BitArray{N} <: AbstractArray{Bool, N}
Space-efficient N-dimensional boolean array, using just one bit for each
boolean value.
BitArrays pack up to 64 values into every 8 bytes, resulting in an 8x space
efficiency over Array{Bool, N} and allowing some operations to work on 64
values at once.
By default, Julia returns BitArrays from broadcasting operations that
generate boolean elements (including dotted-comparisons like .==) as well
as from the functions trues and falses.
β Note
β
β Due to its packed storage format, concurrent access to the
β elements of a BitArray where at least one of them is a write is
β not thread-safe.

The issue is that a BitVector is storing multiple values per byte. More specifically, it stores 64 values per UInt64 (8 bytes). In order to write a single entry, it reads the entire UInt64, sets one of the bits to the value it wants, and then writes the entire UInt64 back to memory.

This opens up the potential for a data race. The reason this is not thread-safe is that multiple threads can read the same UInt64, modify it by changing a single bit, and then write it back. This process takes time. If more than one reads it before the changes can get written back, then whichever writes it last will overwrite the work of everyone else. A lock would avoid this, but also would cause significant slowdowns.

A Vector{Bool} uses 1 byte per entry (rather than 1/8-byte like BitVector) and the bytes are addressed individually (rather than in blocks of 8 via UInt64), so it should be safe from data races in your example. I will recommend you use that instead.

In general, locking data structures have mostly fallen out of favor in CS since they impose a substantial overhead of users that arenβt using multithreading, and they still they often arenβt great for performance compared to custom locking solutions.