Speeding Up a Backward Induction Code

Hi all! I am solving a model and after Profiling the code I realized that the function myfun2 is the big bottleneck of my code. I created a MWE that preserves the computational cost of doing this function and I wanted to ask for help optimizing this function and speeding it as much as possible. Any suggestion or improvement would be very very much appreciated.

Basically, in this code what I do is to evaluate the payoffs of different paths. It is a small combinatorial discrete choice problem and of course, suffers from the course of dimensionality. I a trying to keep it at low dimensions (K<12), but when K is bigger than 8 it already becomes really slow.

Here is the MWE:

using BenchmarkTools

J = 30
T = 20
Π = 75*rand(J,T)
Dᵢ = rand([0],J,1)
feat1 = rand([1,2],J,1)
feat2 = rand([1,2],J,1)
Φ1 = 50
Φ2 = 50
Sj = 200*ones(J)

Π=Π.-Sj

K = 10
Ind =([1,4,10,12,J-15,J-10,J-6,J-5,J-2,J])

function myfun1(Ind,K,D::Array{Int64},Φ1,Φ2,feat1,feat2) #Think about how to make this faster
    X1c = zeros(Int8,size(D,1))
    X2c = zeros(Int8,size(D,1))
    X1l = zeros(Int8,size(D,1))
    X2l = zeros(Int8,size(D,1))
    countc = zeros(Float64,size(D,1))
    countl = zeros(Float64,size(D,1))
    feat1=feat1[Ind]
    feat2=feat2[Ind]
    for j=1:K
        for l=1:K
            X1c[j] = X1c[j] + (feat1[j]==feat1[l])*D[l]
            X1l[j] = X1l[j] + (feat2[j]==feat2[l])*D[l]
            countc[j] = countc[j] + (feat1[j]==feat1[l])
            countl[j] = countl[j] + (feat2[j]==feat2[l])
        end
    end
    X1c = Φ1.*(X1c.>0)./countc
    X1l = Φ2.*(X1l.>0)./countl
    fcc = (ones(K,1)'*X1c)[1]
    fcl = (ones(K,1)'*X1l)[1]
    return fcc, fcl
end

function myfun2(Ind,Π,Φ1,Φ2,Sj,Dᵢ,feat1,feat2,K,J,T)
    β = 0.95
    Π = Π[Ind,:]
    Sj = Sj[Ind]
    D = zeros(Int64,2^K,K,T)
    Indices =  ones(Int64,2^K,T)
    ind =  ones(Int64,T+1)
    vc = zeros(Float64,2^K,2^K,T)
    v = zeros(Float64,2^K,T)
    Dstar = zeros(Int64,K,T+1)
    Dstar[:,1]=Dᵢ[Ind]
    Πstar=0
    ChoiceSet = zeros(Int64,2^K,K)
    for j=0:2^K-1
        int_s = collect(string(j, base=2, pad=K))
        int_s = parse.(Int8,int_s)
        ChoiceSet[j+1,:]=int_s
    end
    for t=0, i=1:2^K
        for ii=1:2^K
            (eg,el)=myfun1(Ind,K,ChoiceSet[ii,:],Φ1,Φ2,feat1,feat2)
            vc[i,ii,T-t] = sum(ChoiceSet[ii,:].*(Π[:,T-t] .+ Sj.*ChoiceSet[i,:])) .- (eg .+ el)
        end
        v[i,T-t] = maximum(vc[i,:,T-t])
        Indices[i,T-t] = argmax(vc[i,:,T-t])
        D[i,:,T-t] = ChoiceSet[argmax(vc[i,:,T-t]),:]
    end
    for t=1:T-1, i=1:2^K
        for ii=1:2^K
            (eg,el)=myfun1(Ind,K,ChoiceSet[ii,:],Φ1,Φ2,feat1,feat2)
            vc[i,ii,T-t] = sum(ChoiceSet[ii,:].*(Π[:,T-t] .+ Sj.*ChoiceSet[i,:])) .- (eg .+ el)+β*v[ii,T-t+1]
        end
        v[i,T-t] = maximum(vc[i,:,T-t])
        Indices[i,T-t] = argmax(vc[i,:,T-t])
        D[i,:,T-t] = ChoiceSet[argmax(vc[i,:,T-t]),:]
    end
    for t=2:T+1
        ind[t] = Indices[ind[t-1],t-1]
        Dstar[:,t]=D[ind[t],:,t-1]
    end
    for t=2:T+1
        (eg,el)=myfun1(Ind,K,Dstar[:,t],Φ1,Φ2,feat1,feat2)
        for j=1:K
            π=β^(t-2)*Dstar[j,t]*(Π[j,t-1]+Sj[j]*Dstar[j,t-1])
            Πstar=π+Πstar
        end
        Πstar = Πstar - β^(t-2)*(eg+el)
    end
    Dstar = Dstar[:,2:end]
    return Dstar, Πstar
end

@benchmark myfun2(Ind,Π,Φ1,Φ2,Sj,Dᵢ,feat1,feat2,K,J,T)

BenchmarkTools.Trial:
  memory estimate:  53.14 GiB
  allocs estimate:  440488318
  --------------
  minimum time:     32.632 s (7.66% GC)
  median time:      32.632 s (7.66% GC)
  mean time:        32.632 s (7.66% GC)
  maximum time:     32.632 s (7.66% GC)
  --------------
  samples:          1
  evals/sample:     1

It’s hard to tell from visual inspection but are you using non-constant globals inside of your functions?

I would like to say I am not, but I am not 100% certain.

The first thing I notice is that your code creates a lot of slices of arrays, like vc[i,:,T-t] etc.

In Julia, those slices create copies of the data, so you are creating copies of your arrays all over the place. You can instead use @view(vc[i, :, T - t]) to create a view into that data, which will not copy it.

1 Like

It appears you’re not using non-const globals (tested by introducing a let block around the globals and the @benchmark call and interpolating the variables into the @benchmark).

Just to confirm:

That is the way it is supposed to be, right?

Yes, avoiding non-const global variables is a good thing. It’s the first thing to check for (literally, the first item in https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-tips-1).

1 Like

This certainly helped! In the MWE, the performance enhancement is not very big, but in the main code it is!

Is there any other suggestion?

I’ll probably take a look in a few hours. It definitely has some opportunities for improvement

1 Like

Thanks so much!

If you are doing that everywhere you can use @views begin ... end to apply the @view macro to every indexing expression.

2 Likes

There are definitely other optimization opportunities here, but your biggest problem seems to be that you call myfun1 way more times than you need to.
(eg,el)=myfun1(Ind,K,ChoiceSet[ii,:],Φ1,Φ2,feat1,feat2)
is in two loops over i when it only depends on ii. Precalculating this so you only run it once for each ii instead of 2^(K+1) times results in ~6x speedup of your example on my computer.

Edit: I actually got that wrong. You were calling it T*2^K times which is even worse.

1 Like

Thanks so much!!!

Also note that you are calling a maximum operation on the same object three times. E.g.:

These could be rolled into a single operation (find the max and its indices).

And if this is backward induction, it should not be necessary to allocate all of the matrices for all dates. For the date t computation, all you need is the date t+1 continuation values. Or do I misunderstand why you allocate everything by t?

1 Like

Yes! I just corrected this issue in the following way:

tempind = argmax(@view(vc[i,:,T-t]))
        Indices[i,T-t] = tempind
        v[i,T-t]=vc[i,tempind,T-t]
        D[i,:,T-t] = ChoiceSet[tempind,:]

You are also right with your second observation.

It appears you are taking a very round-about way to make a Boolean array from the bits, by making a string and parsing it back.

    for j=0:2^K-1
        int_s = collect(string(j, base=2, pad=K))
        int_s = parse.(Int8,int_s)
        ChoiceSet[j+1,:]=int_s
    end

Using BitOperations.jl would make this much more direct, and without the allocations.

let int_s = fill(false,K) #pre-allocate output
        for j=0:2^K-1
                @. int_s = bget(j,0:K-1)
                ChoiceSet[j+1,:]=int_s
        end
end
1 Like

Do you mean to just take sums here? Isn’t this just fcc = sum(X1c) etc?

2 Likes

Your code takes forever to run, which makes it really frustrating to benchmark and iteratively optimize. Can you scale it down by suggesting smaller, but still relevant, array sizes?

Or use the digits or digits! functions from Base.