Speeding Up a Backward Induction Code

This is the farthest I have been able to push my code. It is still a little bit slow. Any recommendation, again, is very very much appreciated.

Apart from speed considerations, I would also appreciate if someone could point me on how to run this code for high K (bigger than 13) because I get an out of memory error.

using Distributed
using BenchmarkTools
using ProgressMeter
using BitOperations
using Profile
using Traceur

J = 50
T = 20
Π = 200*rand(J,T)
Dᵢ = rand([0],J,1)
feat1 = rand([1,2,3,4,5,6,7,8],J,1)
feat2 = rand([1,2,3,4,5,6,7,8,9,10],J,1)
Φ1 = 50
Φ2 = 50
Sj = 200*ones(J)

Π=Π.-Sj

K = 11
Ind =([1,7,9,10,12,J-15,J-10,J-6,J-5,J-2,J])

function myfun1(Ind,K,D::Array{Int64},Φ1,Φ2,feat1,feat2) 
    X1c = zeros(Int8,size(D,1))
    X2c = zeros(Int8,size(D,1))
    X1l = zeros(Int8,size(D,1))
    X2l = zeros(Int8,size(D,1))
    countc = zeros(Float64,size(D,1))
    countl = zeros(Float64,size(D,1))
    feat1=feat1[Ind]
    feat2=feat2[Ind]
    for j=1:K
        for l=1:K
            X1c[j] = X1c[j] + (feat1[j]==feat1[l])*D[l]
            X1l[j] = X1l[j] + (feat2[j]==feat2[l])*D[l]
            countc[j] = countc[j] + (feat1[j]==feat1[l])
            countl[j] = countl[j] + (feat2[j]==feat2[l])
        end
    end
    X1c = Φ1.*(X1c.>0)./countc
    X1l = Φ2.*(X1l.>0)./countl
    fcc = sum(X1c)
    fcl = sum(X1l)
    return fcc, fcl
end

function myfun6(Ind,Π,Φ1,Φ2,Sj,Dᵢ,feat1,feat2,K,J,T)
    β = 0.95
    Π = Π[Ind,:]
    Sj = Sj[Ind]
    D = zeros(Int64,2^K,K,T)
    Indices =  ones(Int64,2^K,T)
    ind =  ones(Int64,T+1)
    v = zeros(Float64,2^K,T)
    Dstar = zeros(Int64,K,T+1)
    Dstar[:,1]=Dᵢ[Ind]
    Πstar=0
    ChoiceSet = zeros(Int64,2^K,K)
    GroupCosts = zeros(Float64,2^K,2)
    let int_s = fill(false,K) #pre-allocate output
        for j=0:2^K-1
            @. int_s = bget(j,0:K-1)
            ChoiceSet[j+1,:]=int_s
            (GroupCosts[j+1,1],GroupCosts[j+1,2])=myfun1(Ind,K,ChoiceSet[j+1,:],Φ1,Φ2,feat1,feat2)
        end
    end
    for t=0:T-1
        vc = zeros(Float64,2^K,2^K)
        for  i=1:2^K
            for ii=1:2^K
                if t>0
                    vc[i,ii] = sum(@view(ChoiceSet[ii,:]).*(@view(Π[:,T-t]) .+ Sj.*@view(ChoiceSet[i,:]))) .- (GroupCosts[ii,1] .+ GroupCosts[ii,2])+β*@view(v[ii,T-t+1])
                else
                    vc[i,ii] = sum(@view(ChoiceSet[ii,:]).*(@view(Π[:,T-t]) .+ Sj.*@view(ChoiceSet[i,:]))) .- (GroupCosts[ii,1] .+ GroupCosts[ii,2])
                end
            end
        #v[i,T-t] = maximum(@view(vc[i,:,T-t]))
        tempind = argmax(@view(vc[i,:]))
        Indices[i,T-t] = tempind
        v[i,T-t]=vc[i,tempind]
        D[i,:,T-t] = ChoiceSet[tempind,:]
    end
    end
    for t=2:T+1
        ind[t] = Indices[ind[t-1],t-1]
        Dstar[:,t]=D[ind[t],:,t-1]
    end
    Dstar = Dstar[:,2:end]
    return Dstar
end
@profile myfun6(Ind,Π,Φ1,Φ2,Sj,Dᵢ,feat1,feat2,K,J,T)
@trace(myfun6(Ind,Π,Φ1,Φ2,Sj,Dᵢ,feat1,feat2,K,J,T), modules=[Main])
f6 = @btime myfun6(Ind,Π,Φ1,Φ2,Sj,Dᵢ,feat1,feat2,K,J,T)

 10.207 s (335611985 allocations: 25.64 GiB)

I’ve got two things you can pretty easily improve in your main loop over T. First you allocate a huge 2^Kx2^K array for vc, but only actually use one row of it at a time. This array is probably why you run out of memory as you make K bigger. You can replace it with a single 2^K vector which then gets reused for each iteration over i.

The second is that the way you are calculating the elements of vc, is doing some allocations. Replacing the sum over the broadcast on a bunch of views with a loop drastically reduces allocations and the time spent on GC.

My new version of that loop looks like the following

vc = zeros(Float64, 2^K)
for t=0:T-1
    for  i=1:2^K
        fill!(vc, 0.0)
        for ii=1:2^K
            for j=1:K
                vc[ii] += ChoiceSet[ii,j]*(Π[j,T-t]+Sj[j]*ChoiceSet[i,j])
            end
            vc[ii] -= GroupCosts[ii,1] + GroupCosts[ii,2]
            if t>0
                vc[ii] += β*v[ii,T-t+1]
            end
        end
        tempind = argmax(vc)
        Indices[i,T-t] = tempind
        v[i,T-t]=vc[tempind]
        D[i,:,T-t] = ChoiceSet[tempind,:]
    end
end
5 Likes

Amazing! Thanks

As far as I can see, there is no need to keep a vector vc around at all. Just use a couple of scalars (say, vc_ii), keep track of its largest value (vc_max) and the corresponding value of ii (say, ii_max).

There are so many allocated arrays here. Surely, you can get rid of more of them?

And then a minor thing, which won’t impact performance much, but is nice to know:

rand(1:10, J)
# instead of 
rand([1,2,3,4,5,6,7,8,9,10],J,1)

The former is both faster and nicer to write. And I’m guessing you should use J instead of J, 1 for the size, since the latter makes a matrix instead of a vector.

1 Like

The main problem with helping you is that the code is a bit large and quite hard to read, it does a lot of convoluted operations with non-intuitive variable names.

I can tell you for sure that there are many things that could be sped up in your code, it appears to be staggeringly inefficient (no offence, i hope :slight_smile: ). But helping you will require sitting down and spending some time to understand what’s going on. Just suggesting a tweak here and there is hard, because there is so much to improve.

Just looking at myfun1 (btw, do you think you could come up with a different name for this and myfun6?), it seems to be doing 99% unnecessary work, over and over. I would not be surprised if it could be sped up two orders of magnitude.

Sorry, to just drop this, and then not actually help, but I just wanted to signal that the relative lack of feedback is probably due to there being too much to fix, rather than the code being near optimized.

1 Like

I think that might be a bit harsh. You’re not wrong that the code could be cleaned up a bit, but from what I can tell the vast majority of the run time of this example is in the main loop I was tweaking above. By my measurement the whole rest of the function seems to take on the order of a few ms, so only that loop really matters for optimizing.

Actually adding an @inbounds on that loop drops the runtime pretty significantly also.

1 Like

Where would you add this?

@inbounds for t=0:T-1

1 Like

I hope not, it’s not my intention to be rude. But it’s the general style of the implementation, with lots of allocations and vectorized operations, that’s the problem. That’s why I think it should go beyond just some cleaning up.

A shift towards ‘scalar thinking’ could probably both speed up and clean up the code.

1 Like

Yeah by and large I don’t disagree with you; and you’ll notice that the improvements I suggested were in keeping with those approaches. It’s just that I think it’s very important when talking about optimization to focus on the portions of the code where improvements can have the greatest impact rather than the things that can most obviously be sped up.

For example, one of the first things that caught my eye in this code (as it clearly did for you) was the inefficiency of myfun1, but in measurements on my machine, with the most recent example, myfun1 accounts for about 0.04% of the runtime, so it doesn’t make much sense to focus on optimizing it even if it has obvious flaws.

And in this case that approach can pay off big time. About an order of magnitude improvement in speed can be obtained with just a couple small changes in the right places. What I meant was that though there are definitely improvements that can be made in other parts of the code, many of them are unlikely to yield any meaningful performance improvements. I tend view that more as cleaning up than optimizing at that point.

2 Likes