Nested multi-thread

I have N groups. For each group I have J*K options for each agent i. I want to find the maximum over j,k for each I, and do so for each group n. The problems are all separate from each others.

          Threads.@threads for j in 1:J
                        for k in j+1:K
                            @inbounds x[j,k,:] = v[j,k,n] + shock[j,k,:]
                        end
                    end
                    maxJK = (x.==maximum(x,dims=[1,2]))
            end

Then maxJK is used to fill some matrices, also indexed by n.

Here the third dimension of shock is the index of agent i. The dimension of I is large (about 10000). If I multithread in the outer loop, so that I multithread the max, I think the code messes up because multiple threads are using maxJK. Regardless, this code seems very slow. I wonder how it could be made faster.

I think that the line @inbounds x[j,k,:] = v[j,k,n] + shock[j,k,:] is allocating when it needn’t be. I think x[j,k,:] .= shock[j,k,:] .+ v[j,k,n] should be faster (or at least work, I could not get your line to work in my REPL) and you might benefit by moving @inbounds to the top level loop.
Likewise, maxJK = (x.==maximum(x,dims=[1,2])) could be replaced by maxJK = argmax(x, dims=[1,2]), yielding a (compact) array of CartesianIndex, while your line returns a large BitArray in my REPL. It may help others understand you if you expand your example to be runnable for the reader, with reasonable sizes?

Can you provide a small working example? You can use views to remove allocations, but the way that is written it does not seem to make sense, since sock[j,k,:] is a slice and one would imagine that v[j,k,n] is a scalar, and you cannot sum a scalar with an array (you would need to broadcast that using .+).

The code there actually does not mess up because of multi-threading, but it is wrong anyway, because maxJK is, as it is, a new array created in the scope of each loop iteration, so either you are overwriting something that was defined outside the loop (and then you need to create one copy of maxJK for each thread and reduce the result later), or the result won’t be available in the outer scope.

But again, having a small working example with a set of input variables and the expected output (a serial version at least) is what will allow everyone to help you better.

Sorry, here it is. The content of the vectors are just examples. The question is about the for loops.

N = 200
J = 300
K = 100
L = 10000
superset = Vector(undef,J)
for j = 1:J
    superset[j] = rand(1:K,j)
end
x = zeros(J,K,L)
v = rand(J,K,N)
shock = rand(J,K,L)

for n = 1:N
    Threads.@threads for j in 1:J
        for k in superset[j]
            @inbounds x[j,k,:] .= v[j,k,n] .+ shock[j,k,:]
        end
    end
    maxJK = (x.==maximum(x,dims=[1,2]))
end

Some important things:

This is a vector of type Any, which is very bad. Use:

superset = Vector{Vector{Int}}(undef,J)

This creates a new array. use @view(shock[j,k,:]) .

  1. Not sure if the MWE is representative of what you are doing, but all that is in global scope. So you should put the code in a function.

I have removed here the maxJK part and the multi-threading, to start with. And with those changes I go from:

 julia> @btime loops($N, $J, $K, $L, $superset, $x, $v, $shock)
  221.230 ms (204600 allocations: 76.63 MiB)

to

julia> @btime loops($N, $J, $K, $L, $superset, $x, $v, $shock)
  220.491 ms (0 allocations: 0 bytes)

so good, the code is non-allocating (side note: did something happen in Julia 1.7.0 that these allocations do not penalize performance as before? It is the second example of that pattern I see).

Probably start with that:

N = 20
J = 30
K = 100
L = 1000
superset = Vector{Vector{Int}}(undef,J)
for j = 1:J
    superset[j] = rand(1:K,j)
end
x = zeros(J,K,L)
v = rand(J,K,N)
shock = rand(J,K,L)

function loops(N, J, K, L, superset, x, v, shock)
    for n = 1:N
        for j in 1:J
            for k in superset[j]
                @inbounds x[j,k,:] .= v[j,k,n] .+ @view(shock[j,k,:])
            end
        end
        #maxJK = (x.==maximum(x,dims=[1,2]))
    end
    #return maxJK
end

@btime loops($N, $J, $K, $L, $superset, $x, $v, $shock)

from there, some things: you should run over the first index first, thus probably you should exchange the j and k indexes.

Finally you need to split maxJK for each thread and reduce the result later.

2 Likes

Ok, I added the type definition. But I think I should just show the full model to get the best advice. Here it is.

The logic is performing a nested maximization. For any n, find the best s. The possible choices of s given n are the elements of superset2. The choice of s is subject to a choice-specific shock. Then, x is the value of n, at the best choice s. It also has a shock2. The second block has the choice of n. In both cases the optimal choices are for groups T1 and T2. I wonder whether this code can be improved (I am sure it can).

I am mainly wondering about the level at which multi-threading is better introduced but probably there are other performance tricks.

T1 = 20
T2 = 10
S = 100
N = 300
L = 20000
superset1 = Vector{Array{Float64,2}}(undef,N)
superset2 = Vector{Vector{Int}}(undef,N)
for n = 1:N
    superset1[n] = zeros(10,L)
    superset2[n] = rand(1:S,10)
end
Tmat = ones(T2,T1)
par = rand(T2)
x = zeros(T2,T1,N,L)
v = rand(S,N)
u = rand(T1,N)
shock1 = rand(S,L)
shock2 = rand(N,L)
maxS = Array{Matrix{CartesianIndex{2}},2}(undef,T2,N)
final_mat = zeros(T2,T1,S,N)

@time Threads.@threads for n = 1:N 
    for t2=1:T2
        @inbounds superset1[n] .= @views( par[t2]*v[superset2[n],n] .+ shock1[superset2[n],:] )
        @inbounds maxS[t2,n]    = argmax(superset1[n],dims=1)
        @inbounds x[t2,:,n,:]  .= @views( u[:,n] .+ superset1[n][maxS[t2,n]] .+ shock2[n,:]'  ) # η*α[n]
    end 
end
@time for t2=1:T2 for t1=1:T1
        indic = Tmat[t2,t1]/L
        Threads.@threads for l in 1:L
            maxN = argmax(@views(x[t2,t1,:,l])) # 
            sInd = superset2[maxN][maxS[t2,maxN][l][1]]
            @inbounds final_mat[t2,t1,sInd,maxN] += indic
        end
end end

The third one I mention is the first and most important.

See Performance Tips · The Julia Language

Yes, I did that. It still feels like I am allocating “too much”.

You did not put your code into a function. This is really the first thing to do. And you still use global variables that are not defined as const.

I did, but I did not report it here. I am not saying that putting the code into a function does not help, I just wonder what else can speed it up.

Another (important) issue here is that final_mat is used by all threads and the summation in the last line is done incorrectly under multi-thread. Meaning, the value of final_mat is different with vs without using multi-threading.

That’s basically impossible. Then something else is wrong. Anyway, step one is wrapping code in a function, step two is figuring out what’s going wrong with that.

Let me share the code wrapped into a function


using Random

function myfun()

    rng = MersenneTwister(1234)

    T1 = 20
    T2 = 10
    S = 100
    N = 300
    L = 20000
    superset1 = Vector{Array{Float64,2}}(undef,N)
    superset2 = Vector{Vector{Int}}(undef,N)
    for n = 1:N
        superset1[n] = zeros(10,L)
        superset2[n] = rand(1:S,10)
    end
    Tmat = ones(T2,T1)
    par = rand(T2)
    x = zeros(T2,T1,N,L)
    v = rand(S,N)
    u = rand(T1,N)
    shock1 = rand(S,L)
    shock2 = rand(N,L)
    maxS = Array{Matrix{CartesianIndex{2}},2}(undef,T2,N)
    final_mat = zeros(T2,T1,S,N)

    @time Threads.@threads for n = 1:N 
        for t2=1:T2
            @inbounds superset1[n] .= @views( par[t2]*v[superset2[n],n] .+ shock1[superset2[n],:] )
            @inbounds maxS[t2,n]    = argmax(superset1[n],dims=1)
            @inbounds x[t2,:,n,:]  .= @views( u[:,n] .+ superset1[n][maxS[t2,n]] .+ shock2[n,:]'  ) # η*α[n]
        end 
    end
    @time for t2=1:T2 for t1=1:T1
            indic = Tmat[t2,t1]/L
            for l in 1:L #Threads.@threads 
                maxN = argmax(@views(x[t2,t1,:,l])) # 
                sInd = superset2[maxN][maxS[t2,maxN][l][1]]
                @inbounds final_mat[t2,t1,sInd,maxN] += indic
            end
    end end

    return sum(final_mat)
end

result = myfun()

I get result = 199.99999999999991

If instead a place Threads.@threads before " for l in 1:L " I get result = 199.98474999999985

There is a small but far from negligible difference.

Thanks for sharing! :slight_smile:

Is there a way to use multi-thread in this context? Or perhaps to change the code so that it does not create mistakes but also does not take too long?