Multi-threading

I’m using windows 10 and julia 1.0.3 and testing out the multi-threaded function.

I understand that when generating random numbers we would need a separate seed for each thread but ignoring that for the purpose of a very preliminary test, I created the following code. In the following im trying to assess the overheads of calling a multi-threaded for loop multiple times.

#set JULIA_NUM_THREADS=4

test = zeros(10000,1000);

@time for i in 1:10000
    Threads.@threads for j in 1:1000
        test[i,j] = rand(1)[1]
    end
end

When I ran the above code, I was returned a continuous stream of error messages, their counts likely correlated to the number of times the threaded loop was called. Sample of the errors are:

Error thrown in threaded loop on thread 0: InexactError(func=:check_top_bit, T=Int64, val=-8)
Error thrown in threaded loop on thread 2: InexactError(func=:check_top_bit, T=Int64, val=-8)
Error thrown in threaded loop on thread 1: InexactError(func=:check_top_bit, T=Int64, val=-8)
Error thrown in threaded loop on thread 0: InexactError(func=:check_top_bit, T=Int64, val=-8)
Error thrown in threaded loop on thread 3: InexactError(func=:check_top_bit, T=Int64, val=-8)
Error thrown in threaded loop on thread 1: InexactError(func=:check_top_bit, T=Int64, val=-8)
Error thrown in threaded loop on thread 2: InexactError(func=:check_top_bit, T=Int64, val=-8)
Error thrown in threaded loop on thread 3: InexactError(func=:check_top_bit, T=Int64, val=-8)

True enough, when i did a sum to check the results, sum(sum(test,dims=1),dims=2), my output was about 40081, way lesser than the expected value of 0.5 \times 10000 \times 1000.

Wondering what mistake I may have made…

Edit1: To think of it I suspect it’s sychronization issues, some threads may have finished and moved on to the next loop?

That’s precisely the problem. It’s not just that you need separate seeds — it’s that you need wholly distinct random number generators! Remember that calling rand doesn’t just return some value, it’s also mutating the state of your generator to move to the next value. When multiple threads try to mutate this state at the same time, you can rapidly run into inconsistencies.

6 Likes

You can’t. Calling the same random number generator from multiple thread is undefined behavior and anything can happen.

Practically, this at least mean that you can corrupt/have corrupted the random number generate (giving you the error) and you’ll not get the correct performance measure. You’ll get very bad performance even without any error.

3 Likes

Ah gosh, I see, this makes sense now. Thanks @mbauman and @yuyichao

I’d take a look at trandjump of you’d like solutions.

1 Like

Does this also apply to distributed computing across cores? If let’s say instead of Threads.@threads, I use @distributed?

No

2 Likes

Byt the way, and incidentally, when people write rand(1)[1], what they almost always want is just rand().

rand(1)[1] allocates a length-1 vector and then reads out the first number, while rand() just returns a random value.

3 Likes

Hi @DNF , @mbauman ,@Elrod,

is Threads.@threads equal to the command parfor in MATLAB?

No afaict MATLAB do not have user controlled thread.

Good question.
@yuyichao, while there maybe no race conditions issue with generating random numbers with @distributed, I guess its safe to say that the random numbers also dont run the risk of overlaps since no race condition issues should mean that there is some synchronization involved between workers?

I sometimes toy with the idea of calling it rand!. Strictly speaking, rand is a lie.

2 Likes

parfor

Execute for -loop iterations in parallel on workers

in Julia Threads.@threads + for == parfor in Matlab ???

This works:

test = zeros(10000,1000);

using Random


const generators = Dict{Int,Any}()
const mutex = Threads.Mutex()


function randomNumber()
   i = Threads.threadid()
   if !haskey(generators,i)
      lock(mutex)
      println("Creating random number generator for thread $i")
      generators[i] = MersenneTwister(rand(RandomDevice(),UInt8))
      unlock(mutex)
   end
   return rand(generators[i])
end


@time for i in 1:10000
    Threads.@threads for j in 1:1000
        test[i,j] = randomNumber()
    end
end

println(sum(test)/length(test))

prints

Creating random number generator for thread 1
Creating random number generator for thread 4
Creating random number generator for thread 2
Creating random number generator for thread 3
  1.071951 seconds (7.12 M allocations: 132.371 MiB, 17.82% gc time)
0.49999995453686974

However I don’t know if it can be improved by using a strongly typed dictionary instead of Dict{Int,Any}

1 Like

No. It is kind of like @parallel with some automatic data transfers. It requires separate processes, so much that using parfor on a cluster requires a new license for each core! That is very different than threading.

2 Likes

Workers and threads are not the same. You can google “threads vs processes” or something to get an explanation. Workers are the same as processes.

No, parfor spawns multiple processes.

Wouldn’t it be using Distributed; @distributed for these days?

A few changes, namely concretely typing the generators dictionary, hoisting generator initialization out of the loop, and putting stuff into a function to avoid the global scope:

using Random
using BenchmarkTools

const generators = Dict{Int,MersenneTwister}()

function dothread(gen)
    for i in 1:Threads.nthreads()
        if !haskey(generators,i)
            println("Creating random number generator for thread $i")
            gen[i] = MersenneTwister(rand(RandomDevice(),UInt8))
        end
    end
    test = zeros(10000,1000)
    Threads.@threads for j in 1:1000
        tidx = Threads.threadid()
        for i in 1:10000
            test[i,j] = rand(gen[tidx])
        end
    end
    return test
end

@btime dothread($generators)

test = dothread(generators)
println(sum(test)/length(test))
JULIA_NUM_THREADS=4 julia threaded-rand.jl 
Creating random number generator for thread 1
Creating random number generator for thread 2
Creating random number generator for thread 3
Creating random number generator for thread 4
  62.128 ms (32 allocations: 76.30 MiB)
0.49998218701926284