Vectorize and parallelize wsample function

Hey everyone,

I have a function that in theory should be fully parallelizable as it is independent of the previous input. In the example below i just fill a vector with categorically distributed random variables:

using Distributions

#Sampling Categorical data
output = zeros(Int64, 3)
prob = [ .4 .6 ; .5 .5 ; .9 .1]

for i in 1:size(prob,1)
output[i] = wsample(prob[i,:]) #works - and seems to be faster than the method below 
#output[i] = rand(Categorical(prob[1,:])) #works
end

Now I would like to vectorize either of the 2 methods above, but the common syntax does not seem to work in this case here:

#vectorize via wsample
wsample.( prob )#NOT WORKING
map(x -> wsample(x), prob ) #NOT WORKING

#Vectorize via Distributions
map(x -> rand(Categorical(x)), prob ) #NOT WORKING
rand.( Categorical.( prob ) ) # NOT WORKING

Question 1: Does anyone know a way to use said methods in a vectorized way? I did not see any thread regarding this, so I guess I must do something wrong as this case comes up in lots of algorithms. I know that this is not necessarily faster than a for-loop in a compiled language but I would at least like to benchmark it for my problems.

As a step further, I would also like to use the pmap function, but that obviously also does not work with the code I provided.

The first question to ask is: why do you want to “vectorize” the code in this way? Loops in Julia are fast, and if you have a working loop there is no need to change it into a map or broadcast call unless that helps you organize or structure your code more clearly.

Assuming that you do want to use map or broadcast, then you need to tell Julia to treat your prob not as a container of individual floats but as a collection of rows. For example, you could broadcast over eachrow(prob) rather than prob.

However, take a look at the answer by @stevengj in Some doubts about types which also applies in your situation. It is common practice in Matlab or numpy to express a collection of vectors as a matrix, but that’s neither ideal nor necessary in Julia. For example, if you were to transpose your prob data and treat it as a collection of SVectors from StaticArrays, then you could easily broadcast over it:

julia> prob = [SVector(.4, .5, .9), SVector(.6, .5, .1)]
2-element Array{SArray{Tuple{3},Float64,1,3},1}:
 [0.4, 0.5, 0.9]
 [0.6, 0.5, 0.1]

julia> sum.(prob)
2-element Array{Float64,1}:
 1.8               
 1.2000000000000002

There is no difference in memory layout between a matrix where each column is an element or a vector of SVectors representing those columns, but it is much easier to operate on each element if you actually represent your collection of vectors as a collection of vectors.

4 Likes

Just showing some examples of what @rdeits explained.

Make a vector of vectors

julia> prob = [[0.4, 0.6], [0.5, 0.5], [0.9, 0.1]]
3-element Array{Array{Float64,1},1}:
 [0.4, 0.6]
 [0.5, 0.5]
 [0.9, 0.1]

julia> wsample.(prob)
3-element Array{Int64,1}:
 2
 1
 1

# benchmark
julia> @btime wsample.($prob);
  105.158 ns (4 allocations: 208 bytes)

Using SVector:

julia> prob = [SVector(0.4, 0.6), SVector(0.5, 0.5), SVector(0.9, 0.1)]
3-element Array{SArray{Tuple{2},Float64,1,2},1}:
 [0.4, 0.6]
 [0.5, 0.5]
 [0.9, 0.1]

julia> wsample.(prob)
3-element Array{Int64,1}:
 2
 1
 1

# benchmark (it's slightly faster)
julia> @btime wsample.($prob)
  77.710 ns (4 allocations: 208 bytes)
3-element Array{Int64,1}:
 1
 2
 1

You can in principle parallelize this with threads, but there is a serious caveat: As far as I know random number generation is currently not thread-safe, so the results are not reliable:

julia> s = zeros(3);

julia> Threads.@threads for i in eachindex(prob)
       s[i] = wsample(prob[i])
       end

julia> s
3-element Array{Float64,1}:
 2.0
 2.0
 1.0

# benchmark
julia> @btime Threads.@threads for i in eachindex($prob)
       $s[i] = wsample($prob[i])
       end
  472.513 ns (2 allocations: 78 bytes)

Yikes! With 4 threads, this is much slower than serial. The overhead appears to dwarf the advantages of parallelization. I think the overhead would be even much worse for pmap.

1 Like

Oh, also, I should mention that if your primary concern is performance, please take a look at https://docs.julialang.org/en/v1/manual/performance-tips/index.html , particularly the part about putting your computation into a function rather than relying on global variables as your example code does.

2 Likes

Thank you for your answer!

In case both methods perform similarly, I wanted to use the vectorised method as it is a oneliner and I have to do similar calculations in several steps.

I will have a look at the StaticArrays module as well, thank you for the suggestion!

Thanks also for answering the parallelization part - I will put that off now and wait for/if RNG becoming thread-safe then.