Performance Tips for Combinatorial Problem

Hello! I have a combinatorial discrete choice problem that I solve by Backward Induction. The problem seems to be intensive in both memory and computation time and I would be super happy to improve it on any of these two dimensions. Right now, the problem runs Out of Memory when K>12. I post a MWE

using BenchmarkTools
using ProgressMeter
using BitOperations
using Distributions

J = 1000
T = 20
Π = 100*rand(J,T)
Dᵢ = rand([0],J,1)
feat1 = rand([1,2,3,4,5,6,7,8],J,1)
feat2 = rand([1,2,3,4,5,6,7,8,9,10],J,1)
Φ1 = 10
Φ2 = 10
Sj = 20*rand(J,T)
Fj = 20*rand(J,T)

Π=Π.-Sj.-Fj

K = 10
Ind =([1,4,7,9,10,12,J-10,J-6,J-5,J-2])

# This is the Backward Induction Algorithm with interdependent choices

function myfun1(Ind,K,D::Array{Int64},Φ1,Φ2,feat1,feat2) 
    X1c = zeros(Int8,size(D,1))
    X2c = zeros(Int8,size(D,1))
    X1l = zeros(Int8,size(D,1))
    X2l = zeros(Int8,size(D,1))
    countc = zeros(Float64,size(D,1))
    countl = zeros(Float64,size(D,1))
    feat1=feat1[Ind]
    feat2=feat2[Ind]
    for j=1:K
        for l=1:K
            X1c[j] = X1c[j] + (feat1[j]==feat1[l])*D[l]
            X1l[j] = X1l[j] + (feat2[j]==feat2[l])*D[l]
            countc[j] = countc[j] + (feat1[j]==feat1[l])
            countl[j] = countl[j] + (feat2[j]==feat2[l])
        end
    end
    X1c = Φ1.*(X1c.>0)./countc
    X1l = Φ2.*(X1l.>0)./countl
    fcc = sum(X1c)
    fcl = sum(X1l)
    return fcc, fcl
end


function pshockprob(sequences,prob)
    (KK,K)=size(sequences)
    proba = zeros(KK)
    for i=1:KK
        proba[i]=prob^(sum(sequences[i,:]))*(1-prob)^(K-sum(sequences[i,:]))
    end
    return proba
end


function shocks2ind(sequences,shockpath)
    (K,T)=size(shockpath)
    inds = zeros(T)
    shockpath[shockpath.==2] .= 0
    for i=1:T
        inds[i]=matchrow(shockpath[:,i]',sequences)
    end
    return convert(Array{Int64}, inds)
end


function myfun7(Ind,Π,Φ1,Φ2,Sj,prob,SeqShocks,Dᵢ,feat1,feat2,K,J,T)
    β = 0.9
    Π = Π[Ind,:]
    Sj = Sj[Ind,:]
    SeqShocks = SeqShocks[Ind,:]
    D = zeros(Int64,2^K,2^K,K,T)
    Indices =  ones(Int64,2^K,2^K,T)
    Dpast =  ones(Int64,T+1)
    v = zeros(Float64,2^K,2^K,T)
    Dstar = zeros(Int64,K,T+1)
    Dstar[:,1]=Dᵢ[Ind]
    Πstar=0
    ChoiceSet = zeros(Int64,2^K,K)
    ShockSet = zeros(Int64,2^K,K)
    GroupCosts = zeros(Float64,2^K,2^K,2)
    let int_s = fill(false,K) #pre-allocate output
        for j=0:2^K-1
            @. int_s = bget(j,0:K-1)
            ChoiceSet[j+1,:]=int_s
            ShockSet[j+1,:]=int_s
        end
    end
    for j=1:2^K, jj=1:2^K
        (GroupCosts[j,jj,1],GroupCosts[j,jj,2])=myfun1(Ind,K,ChoiceSet[j,:].*ShockSet[jj,:],Φ1,Φ2,feat1,feat2)
    end
    pstatevec=pshockprob(ShockSet,prob)
    indshocks=shocks2ind(ShockSet,SeqShocks)
    #println(indshocks)
    #println(ShockSet)
    #println(pstatevec)
    #println(GroupCosts)
    vc = zeros(Float64, 2^K)
    for t=0:T-1
        for  i=1:2^K, iii=1:2^K
            #Basically what is wrong here is that I am still allowing people to enter
            fill!(vc, 0.0)
            for ii=1:2^K
                    for j=1:K
                        vc[ii] += ChoiceSet[ii,j]*(Π[j,T-t]+Sj[j,T-t]*ChoiceSet[i,j]-(1-ShockSet[iii,j])*999999999999999999999999999)
                    end
                    vc[ii] -= GroupCosts[ii,iii,1] + GroupCosts[ii,iii,2]
                    if t>0
                        vc[ii] += β*v[ii,:,T-t+1]'*pstatevec
                    end
            end
            #println(vc)
            tempind = argmax(vc)
            Indices[i,iii,T-t] = tempind
            v[i,iii,T-t]=vc[tempind]
            D[i,iii,:,T-t] = ChoiceSet[tempind,:]
            end
    end
    for t=2:T+1
        Dpast[t] = Indices[Dpast[t-1],indshocks[t-1],t-1]
        Dstar[:,t]=D[Dpast[t],indshocks[t-1],:,t-1]
    end
    Dstar = Dstar[:,2:end]
    return Dstar
end

println("TIMES")
p = 0.3
SeqShocks = (rand(J,T+1).>p).+1

BIC = @btime myfun7(Ind,Π,Φ1,Φ2,Sj,p,SeqShocks,Dᵢ,feat1,feat2,K,J,T)
println(sum(BIC))

Thanks so much in advance!

I don’t quite understand the problem so far, some description on what you wish to achieve might be helpful. But there are a few tips I see so far:

  1. You can do myfun1 without the allocation:
function newfun1(Ind, K, D::Array{Int64}, Φ1, Φ2,
                 feat1, feat2)
    fcc, fcl = 0.0, 0.0

    for j in Ind
        Tempc, Templ = 0.0, 0.0
        Ctc, Ctl = 0, 0
        for l in Ind
            Eq1 = feat1[j] == feat1[l]
            Eq2 = feat2[j] == feat2[l]
            Ctc += Eq1
            Ctl += Eq2
            Tempc += Eq1*D[l]
            Templ += Eq2*D[l]
        end
        fcc += Tempc/Ctc
        fcl += Templ/Ctl
    end
    fcc*Φ1, fcl*Φ2
end
  1. You are calling myfun1 many times with the same parameters, you can basically “memoize” the result in a variety of ways.
2 Likes

Hey! Thanks for the reply. Essentially, my problem is one in which I need to solve for the optimal choices of an entity using Backward Induction. The problem becomes combinatorial because choices are interdependent so the entity is choosing over vectors of size K, and the number of potential choices corresponds to the Power set 2^K.

The main problem I have right now is that for large K (K>10), when I do the allocations at the beginning of myfun7 I have a memory problem.

You’re not doing all allocations at the beginning, because lines like this

Dstar[:,t]=D[Dpast[t],indshocks[t-1],:,t-1]

allocate on the right-hand side. Slices allocate by default in Julia. You can avoid allocation from slices by putting @views at the beginning, e.g. @views function myfun7(...)

1 Like

This is indeed a very helpful advice. For K=7, it reduces memory allocations by a quarter and computation time by half. So thanks a lot!!

The code is still slow but I guess that on the memory front it is fine. Still it is a little bit slow because of the huge size of the loops.

I am posting here the updated version of the code:

using BenchmarkTools
using ProgressMeter
using BitOperations
using Distributions

J = 1000
T = 20
Π = 100*rand(J,T)
Dᵢ = rand([0],J,1)
feat1 = rand([1,2,3,4,5,6,7,8],J,1)
feat2 = rand([1,2,3,4,5,6,7,8,9,10],J,1)
Φ1 = 10
Φ2 = 10
Sj = 20*rand(J,T)
Fj = 20*rand(J,T)

Π=Π.-Sj.-Fj

K = 7
Ind =([1,4,7,10,12,13,14])

#@everywhere K = 2
#@everywhere Ind =([1,2])


matchrow(a,B) = findfirst(i->all(j->a[j] == B[i,j],1:size(B,2)),1:size(B,1))

# This is the Backward Induction Algorithm with interdependent choices

@views function myfun1(Ind,K,D::Array{Int64},Φ1,Φ2,feat1,feat2) #Think about how to make this faster
    X1c = zeros(Int8,size(D,1))
    X2c = zeros(Int8,size(D,1))
    X1l = zeros(Int8,size(D,1))
    X2l = zeros(Int8,size(D,1))
    countc = zeros(Float64,size(D,1))
    countl = zeros(Float64,size(D,1))
    feat1=feat1[Ind]
    feat2=feat2[Ind]
    for j=1:K
        for l=1:K
            X1c[j] = X1c[j] + (feat1[j]==feat1[l])*D[l]
            X1l[j] = X1l[j] + (feat2[j]==feat2[l])*D[l]
            countc[j] = countc[j] + (feat1[j]==feat1[l])
            countl[j] = countl[j] + (feat2[j]==feat2[l])
        end
    end
    X1c = Φ1.*(X1c.>0)./countc
    X1l = Φ2.*(X1l.>0)./countl
    fcc = sum(X1c)
    fcl = sum(X1l)
    return fcc, fcl
end

@views function pshockprob(sequences,prob)
    (KK,K)=size(sequences)
    proba = zeros(KK)
    for i=1:KK
        proba[i]=prob^(sum(sequences[i,:]))*(1-prob)^(K-sum(sequences[i,:]))
    end
    return proba
end


@views function shocks2ind(sequences,shockpath)
    (K,T)=size(shockpath)
    inds = zeros(T)
    shockpath[shockpath.==2] .= 0
    for i=1:T
        inds[i]=matchrow(shockpath[:,i]',sequences)
    end
    return convert(Array{Int64}, inds)
end


@views function myfun7(Ind,Π,Φ1,Φ2,Sj,prob,SeqShocks,Dᵢ,feat1,feat2,K,J,T)
    β = 0.9
    Π = Π[Ind,:]
    Sj = Sj[Ind,:]
    SeqShocks = SeqShocks[Ind,:]
    Indices =  ones(Int64,2^K,2^K,T)
    Dpast =  ones(Int64,T+1)
    v = Array{Float64}(undef,2^K,2^K,T)
    Dstar = zeros(Int64,K,T+1)
    Dstar[:,1]=Dᵢ[Ind]
    Πstar=0
    ChoiceSet = Array{Union{Nothing, Int}}(nothing,2^K,K)
    ShockSet = Array{Union{Nothing, Int}}(nothing,2^K,K)
    GroupCosts = Array{Float64}(undef,2^K,2^K,2)
    let int_s = fill(false,K) #pre-allocate output
        for j=0:2^K-1
            @. int_s = bget(j,0:K-1)
            ChoiceSet[j+1,:]=int_s
            ShockSet[j+1,:]=int_s
        end
    end
    for j=1:2^K, jj=1:2^K
        GroupCosts[j,jj,:].=myfun1(Ind,K,ChoiceSet[j,:].*ShockSet[jj,:],Φ1,Φ2,feat1,feat2)
    end
    pstatevec=pshockprob(ShockSet,prob)
    indshocks=shocks2ind(ShockSet,SeqShocks)
    #println(indshocks)
    #println(ShockSet)
    #println(pstatevec)
    #println(GroupCosts)
    vc = zeros(Float64, 2^K)
    for t=0:T-1
        for  i=1:2^K, iii=1:2^K
            fill!(vc, 0.0)
            for ii=1:2^K
                    for j=1:K
                        vc[ii] += ChoiceSet[ii,j]*(Π[j,T-t]+Sj[j,T-t]*ChoiceSet[i,j]-(1-ShockSet[iii,j])*999999999999999999999999999)
                    end
                    vc[ii] -= GroupCosts[ii,iii,1] + GroupCosts[ii,iii,2]
                    if t>0
                        vc[ii] += β*v[ii,:,T-t+1]'*pstatevec
                    end
            end
            #println(vc)
            tempind = argmax(vc)
            Indices[i,iii,T-t] = tempind
            v[i,iii,T-t]=vc[tempind]
            end
    end
    for t=2:T+1
        Dpast[t] = Indices[Dpast[t-1],indshocks[t-1],t-1]
        Dstar[:,t]= ChoiceSet[Dpast[t],:]
    end
    Dstar = Dstar[:,2:end]
    return Dstar
end


println("TIMES")
p = 0.3
SeqShocks = (rand(J,T+1).>p).+1


BIC = @btime myfun7(Ind,Π,Φ1,Φ2,Sj,p,SeqShocks,Dᵢ,feat1,feat2,K,J,T)

The way you are handling the data in v is also killing your performance. You only need to keep v from the previous iteration over t rather than storing the values from every t. You’re also doing the following line inside a loop over i and iii even though it doesn’t depend on them and it allocates.
vc[ii] += β*v[ii,:,T-t+1]'*pstatevec
This also does a bunch of vector-vector multiplications. Since you are eventually going to use all of the vectors in v you can do a single matrix-vector multiplication instead.

I would make that loop look more like this:

for t=0:T-1
        for  i=1:2^K, iii=1:2^K
            fill!(vc, 0.0)
            for ii=1:2^K
                    for j=1:K
                        vc[ii] += ChoiceSet[ii,j]*(Π[j,T-t]+Sj[j,T-t]*ChoiceSet[i,j]-(1-ShockSet[iii,j])*999999999999999999999999999)
                    end
                    vc[ii] -= GroupCosts[ii,iii,1] + GroupCosts[ii,iii,2]
                    if t>0
                        vc[ii] += v_previous[ii]
                    end
            end
            #println(vc)
            tempind = argmax(vc)
            Indices[i,iii,T-t] = tempind
            v[i,iii]=vc[tempind]
        end
        mul!(v_previous, v, pstatevec)
        v_previous .*= β
    end

Note that I dropped the third dimension from v and preallocated v_previous as a vector.

1 Like

These lines in myfun1 are suspicious:

    X1c = zeros(Int8,size(D,1))
    X2c = zeros(Int8,size(D,1))
    X1l = zeros(Int8,size(D,1))
    X2l = zeros(Int8,size(D,1))
...
    X1c = Φ1.*(X1c.>0)./countc
    X1l = Φ2.*(X1l.>0)./countl

The last two lines compute arrays of floats in the right side. Hence, bye-bye type stability.
Maybe rewrite the following sums as

    fcc = sum(a * (b > 0) / c for (a,b,c) in zip(Φ1, X1c, countc))
    fcl = sum(a * (b > 0) / c for (a,b,c) in zip(Φ2, X1l, countl))

In shocks2ind: shockpath[shockpath.==2] .= 0 allocates an array for shockpath.==2. Better:

replace!(shockpath, 2=>0)
2 Likes

It also helps to explicitly hoist some of the computations up in the loops. For example:

for  i=1:2^K, iii=1:2^K
    for j=1:K
        temp[j] = (Π[j,T-t]+Sj[j,T-t]*ChoiceSet[i,j]-(1-ShockSet[iii,j])*999999999999999999999999999)
    end
    fill!(vc, 0.0)
    for ii=1:2^K
            for j=1:K
                vc[ii] += ChoiceSet[ii,j]*temp[j]
            end

Also remember that @inbounds can speed up some of your loops.

1 Like

Hey! Thanks so much. This greatly improved the codes performance.

Thanks so much for the heads up.

What is the potential drawback of using @views ?

This is not important for performance, but this

should really be

Dᵢ = zeros(Int, J) 
feat1 = rand(1:8, J)

Especially using rand to create a vector of zeros is pretty non-standard.

views of non-contiguous memory can be slower than copying if used a lot. Also in previous versions (0.6 and earlier) the performance of views were significantly worse, which is the main reason copying is the default.

2 Likes

The explicit loop is twice as fast on my machine …

sval = 0.0
for i in eachindex(countc)
       sval += Φ1[i] * (X1c[i] > 0) / countc[i]
end
1 Like

I took another look at this and it occurred to me that the core thing eating up a lot of time in your main loop is actually basically matrix multiplication, but being done with a naive nested loop. If you rearrange it to actually call an optimized matrix multiplication routine you can get a big speedup.

v_previous = zeros(Float64,2^K)
temp = zeros(Float64, K, 2^K)
temp1 = similar(GroupCosts)
temp2 = similar(temp1)
Choice = Float64.(ChoiceSet)
@inbounds for t=0:T-1
    temp1 .= v_previous.-GroupCosts
    for  i=1:2^K
        for iii=1:2^K
            for j=1:K
                temp[j,iii] = (Π[j,T-t]+Sj[j,T-t]*ChoiceSet[i,j]-(1-ShockSet[iii,j])*999999999999999999999999999)
            end
        end
        mul!(temp2, Choice, temp)
        for iii=1:2^K
            best = -Inf
            tempind = 1
            for ii=1:2^K
                vc = temp2[ii,iii]+temp1[ii,iii]
                if vc > best
                    best = vc
                    tempind = ii
                end
            end
            Indices[i,iii,T-t] = tempind
            v[i,iii]=best
        end
    end
    mul!(v_previous, v, pstatevec)
    v_previous .*= β
end

I also got rid of argmax and did that explicitly in the loop, because it gave a meaningful speedup.
Edit: oops I forgot to mention that I made GroupCosts a 2D array holding the sum of the things you had stored along the 3rd dimension since you never used them independently.

2 Likes

Hello!

Do you think there can be potential gains from using the sparsity of these arrays somehow in this code, given the fact that the outcomes are going to be mostly zeros.

@bcmichael

The only way to know for sure is to try it and see what the timings look like, but my initial guess is probably not. The great majority of the time seems to be spent in the matrix multiplication and finding the maximum values in each column of the result of the matrix multiplication. Your matrices really aren’t that sparse, so the multiplication would probably get slower if you tried to use a sparse matrix representation. The result of the matrix multiplication isn’t sparse basically at all, so a sparse representation can’t really help make finding the maximums faster.

The only thing I can really think of to help speed things up is trying to take advantage of the parallel nature of what you’re doing. The matrix multiplication should already be using multiple threads under the hood, but you could try to use multi-threading for finding the maximum values in the columns. Alternatively this would probably be a decent candidate for kicking the work over to a GPU if you have one.