Pushing a function to its performance limit

jmcastro2109 · March 6, 2020, 4:12pm

Hi! I am trying to push this code to be as efficient as possible. If I could achieve any performance gain it would be great because I am calling this function thousands of time.

The problem is essentially a problem of backward induction problem. In the last period, I need to choose the option that maximizes some payoff and in the periods before I need to maximize a payoff knowing that in the future I am behaving “optimally”.

If you consider there is a more efficient solution “paradigm” or something aside from small perfomance tweaks I would be super interested as well.

using BenchmarkTools
using ProgressMeter
using BitOperations
using Distributions
using LinearAlgebra

J = 1000
T = 200
Π = 100*rand(J,T)
Dᵢ = rand([0],J,1)
feat1 = rand([1,2,3,4,5,6,7,8],J,1)
feat2 = rand([1,2,3,4,5,6,7,8,9,10],J,1)
Φ1 = 10
Φ2 = 10
Sj = 20*rand(J,T)
Fj = 20*rand(J,T)

Π=Π.-Sj.-Fj

# This is the Backward Induction Algorithm with independent choices


@views function myfun2(Π,Fj,Sj,prob,SeqShocks,Dᵢ,β,J,T)
    D = zeros(Int64,2,2,T,J)
    v = zeros(Float64,2,2,T,J)
    Dstar = zeros(Int64,J,T+1)
    Vstar = zeros(Float64,J,T+1)
    Dstar[:,1]=Dᵢ[1:J]
    S = [0 1]
    Shock=[0 999999999999999999999999999]
    func2inner1!(Π, Sj, S, D, v, prob, Shock, β, J, T)
    funct2inner2!(Dstar, Vstar, D, v, SeqShocks, T, J)
    return Dstar[:,2:end], Vstar[:,2:end]
end

function func2inner1!(Π, Sj, S, D, v, prob, Shock, β, J, T)
    Ev = zeros(Float64,2,T,J)
     @inbounds for t ∈ 0:T-1, j ∈ 1:J, i ∈ 1:2
       for ii ∈ 1:2
            if t==0 && ii == 1
                payoff  = Π[j,T-t]+Sj[j,T-t]*S[i]-Shock[ii]
                D[i,ii,T-t,j] = (payoff>0)
                v[i,ii,T-t,j] = D[i,ii,T-t,j]*payoff
            elseif ii==1
                payoff_enter = Π[j,T-t]+Sj[j,T-t]*S[i]+β*Ev[2,T-t+1,j]
                payoff_exit  = β*Ev[1,T-t+1,j]
                d=(payoff_enter>payoff_exit)
                D[i,ii,T-t,j]   = d
                v[i,ii,T-t,j]   = d*(payoff_enter)+(1-d)*payoff_exit
            else
                D[i,ii,T-t,j]   = 0
                payoff_exit  = β*Ev[1,T-t+1,j]
                v[i,ii,T-t,j]   = payoff_exit
            end
        end
        Ev[i,T-t,j]=prob*v[i,1,T-t,j]+(1-prob)*v[i,2,T-t,j]
    end
end

function funct2inner2!(Dstar, Vstar, D, v, SeqShocks, T, J)
    @inbounds for j=1:J, t=2:T+1
        Dstar[j,t]=D[Dstar[j,t-1]+1,SeqShocks[j,t-1],t-1,j]
        Vstar[j,t]=v[Dstar[j,t-1]+1,SeqShocks[j,t-1],t-1,j]
    end
end


println("TIMES")
p = 0.9
SeqShocks = (rand(J,T).>p).+1


BI = @btime myfun2(Π,Fj,Sj,p,SeqShocks,Dᵢ,0.9,J,T)

nilshg · March 6, 2020, 4:18pm

Have you profiled it to see what the bottlenecks are?

jmcastro2109 · March 6, 2020, 4:29pm

Yes, here it goes!

13 .\task.jl:333; (::Atom.var"#19#21"{Array{Any,1}})()
 13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\comm.jl:164; handlemsg(::Dict{String,Any}, ::Dict{String,Any})
  13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:86; (::Atom.var"#124#129")(::Dict{String,Any})
   13 C:\Users\jmcastro\.julia\packages\Media\ItEPc\src\dynamic.jl:24; macro expansion
    13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:91; macro expansion
     13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\repl.jl:85; hideprompt(::Atom.var"#125#130"{String,Int64,String,Bool})
      13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:92; #125
        13 .\logging.jl:395; with_logstate(::Atom.var"#126#131"{String,Int64,String,Bool}, ::Base.CoreLogging.LogState)
         13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:93; #126
          13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:47; withpath(::Function, ::String)
           13 C:\Users\jmcastro\.julia\packages\CodeTools\sf1Tz\src\utils.jl:30; withpath(::Atom.var"#127#132"{String,Int64,String,Bool}, ::String)
            13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:94; (::Atom.var"#127#132"{String,Int64,String,Bool})()
             13 C:\Users\jmcastro\.julia\packages\CodeTools\sf1Tz\src\eval.jl:30; include_string(::Module, ::String, ::String, ::Int64)
              13 .\loading.jl:1075; include_string(::Module, ::String, ::String)
               8 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:31; myfun2(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,2}, ::Float64, ::Array{Int64,2}, ::Array{Int64,2}, ::Float64,...
                3 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:0; func2inner1!(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Int64,2}, ::Array{Int64,4}, ::Array{Float64,4}, ::Float64, ::Ar...
                5 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:45; func2inner1!(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Int64,2}, ::Array{Int64,4}, ::Array{Float64,4}, ::Float64, ::Ar...
                 1 .\float.jl:405; *
                 4 .\operators.jl:529; +
                  4 .\float.jl:401; +
               4 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:32; myfun2(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,2}, ::Float64, ::Array{Int64,2}, ::Array{Int64,2}, ::Float64,...
                3 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:61; funct2inner2!(::Array{Int64,2}, ::Array{Float64,2}, ::Array{Int64,4}, ::Array{Float64,4}, ::Array{Int64,2}, ::Int64, ::Int64)
                1 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:63; funct2inner2!(::Array{Int64,2}, ::Array{Float64,2}, ::Array{Int64,4}, ::Array{Float64,4}, ::Array{Int64,2}, ::Int64, ::Int64)
                 1 .\array.jl:745; getindex
1  .\task.jl:401; task_done_hook(::Task)
 1 .\task.jl:667; wait()
  1 .\task.jl:660; poptaskref(::Base.InvasiveLinkedListSynchronized{Task})

tbeason · March 6, 2020, 4:57pm

It looks like you are doing value function iteration on a Bellman equation in finite horizon. As far as I know, the fastest way to solve it is to parallelize what you can.

jmcastro2109 · March 6, 2020, 5:07pm

Hi thanks for the reply!

You’re right I am solving a finite horizon dynamic problem, however I don’t need to do value function iteration. Also, the reason I can’t parallelize this is that it is already part of a function that is being paralleized.

Obtener Outlook para iOS

mike_k · March 6, 2020, 5:41pm

Maybe I am wrong, but I would try to put what you do in the global scope in a main function and then declare all variable types there. Probably it also helps to do this within the other functions, as well as declaring the types of the function arguments and return values. At least there are such hints in the performance tips of Julia (and I also observed sometimes very slow execution times when variable types etc. are not declared). Also be careful when using packages, since some functions my return type ::Any, and this may result in super slow code.

yuyichao · March 6, 2020, 5:51pm

No. Not at all.

carstenbauer · March 6, 2020, 6:24pm

Where is it mentioned in the performance tips that one should annotate types everywhere? It is not necessary for performance, especially not in function signatures.

You seem to imply here that this is a common case. While possible, this is super rare - and almost certainly a bug in the respective package. Are you really suggesting to check every called method from external packages?

DNF · March 6, 2020, 6:41pm

I can’t get around to looking at the entire code now, but I have to ask: I’ve seen this once before, but where did you get the idea to use the rand function to generate zeros? (You should use zeros(Int, J) or fill(0, J).)

Aside from that, make sure that your inner loops are over the first dimension, since Julia Arrays are column major.

mike_k · March 6, 2020, 6:42pm

Sorry for that imprecision. I’ve meant that one should avoid working with global variables (and there is a section about that in the performance tips).

Yes and no. As I understood it, Julia tries to find a good balance between ease of use and performance. Depending on the use case, developers may put preference on either the one or the other. But users might have the opposite preference.

carstenbauer · March 6, 2020, 6:51pm

Julia is greedy, it tries to be easy to use and performant. Don’t worry, most packages care a lot about performance.

jmcastro2109 · March 6, 2020, 9:19pm

Thanks. I just figured out that there is an issue since the order was not being “respected” between the output and input arrays. Anyways, any other recommendation is deeply appreciated.

jmcastro2109 · March 9, 2020, 11:54am

Hi all! This is my current version of the MWE code after @DNF 's comments. It is around 1.25x faster than before. However, I’d still love to make some performance enhancements because I am calling this function around 100,000 per iteration so I really want to make sure there is no way to improve it and that there are not obvious flaws. Keep in mind I cannot parallelize this because it is embedded in an algorithm which is being parallelized.
using BenchmarkTools
using ProgressMeter
using BitOperations
using Distributions
using LinearAlgebra

J = 10000
T = 200
Π = 100*rand(J,T)
Dᵢ = zeros(Int, J)
feat1 = rand([1,2,3,4,5,6,7,8],J,1)
feat2 = rand([1,2,3,4,5,6,7,8,9,10],J,1)
Φ1 = 10
Φ2 = 10
Sj = 20*rand(J,T)
Fj = 20*rand(J,T)

Π=Π.-Sj.-Fj

# This is the Backward Induction Algorithm with independent choices


@views function myfun2(Π,Fj,Sj,prob,SeqShocks,Dᵢ,β,J,T)
    D = zeros(Int64,2,2,J,T)
    v = zeros(Float64,2,2,J,T)
    Dstar = zeros(Int64,J,T+1)
    Vstar = zeros(Float64,J,T+1)
    Dstar[:,1]=Dᵢ[1:J]
    S = [0 1]
    Shock=[0 999999999999999999999999999]
    func2inner1!(Π, Sj, S, D, v, prob, Shock, β, J, T)
    funct2inner2!(Dstar, Vstar, D, v, SeqShocks, T, J)
    return Dstar[:,2:end], Vstar[:,2:end]
end

function func2inner1!(Π, Sj, S, D, v, prob, Shock, β, J, T)
    Ev = zeros(Float64,2,J,T)
     @inbounds for t ∈ 0:T-1, j ∈ 1:J, i ∈ 1:2
       for ii ∈ 1:2
            if t==0 && ii == 1
                payoff  = Π[j,T-t]+Sj[j,T-t]*S[i]-Shock[ii]
                D[i,ii,j,T-t] = (payoff>0)
                v[i,ii,j,T-t] = D[i,ii,j,T-t]*payoff

            elseif  t==0 && ii == 2


            elseif ii==1
                payoff_enter = Π[j,T-t]+Sj[j,T-t]*S[i]+β*Ev[2,j,T-t+1]
                payoff_exit  = β*Ev[1,j,T-t+1]
                d=(payoff_enter>payoff_exit)
                D[i,ii,j,T-t]   = d
                v[i,ii,j,T-t]   = d*(payoff_enter)+(1-d)*payoff_exit
            else
                D[i,ii,j,T-t]   = 0
                payoff_exit  = β*Ev[1,j,T-t+1]
                v[i,ii,j,T-t]   = payoff_exit
            end
        end
        Ev[i,j,T-t]=prob*v[i,1,j,T-t]+(1-prob)*v[i,2,j,T-t]
    end
end

function funct2inner2!(Dstar, Vstar, D, v, SeqShocks, T, J)
    @inbounds for j=1:J, t=2:T+1
        Dstar[j,t]=D[Dstar[j,t-1]+1,SeqShocks[j,t-1],j,t-1]
        Vstar[j,t]=v[Dstar[j,t-1]+1,SeqShocks[j,t-1],j,t-1]
    end
end


println("TIMES")
p = 0.9
SeqShocks = (rand(J,T).>p).+1


BI = @btime myfun2(Π,Fj,Sj,p,SeqShocks,Dᵢ,0.9,J,T)

sdanisch · March 9, 2020, 11:56am

Maybe try https://github.com/chriselrod/LoopVectorization.jl?

jmcastro2109 · March 9, 2020, 12:03pm

This seems like a fruitful avenue to explore! Thanks

jmcastro2109 · March 9, 2020, 12:17pm

Now, that I see this, the only index that can be reordered is the j index, so I am not sure that this is going to be super useful.

baggepinnen · March 9, 2020, 12:20pm

I think that all the conditions in the loop make it hard to vectorize, even for LopVectorization. I would try to rewrite it in order to remove the statements that are to be run the first iteration only and put them outside the loops. The ii loop can be manually unrolled to get rid of it completely.

If you do get the compiler to emit SIMD instructions, you could potentially speed it up by ~4x. You may also consider running it with Float32 if you do not need the extra accuracy. The smaller bitwidth make twice as much data fit into cache and if it vectorizes, you get twice the throughput.

Topic		Replies	Views
Speeding Up a Backward Induction Code Performance question	30	1278	September 11, 2019
Improving Performance of this Code Performance economics , loops	12	1333	November 22, 2019
How to speed up my code Performance	8	931	August 24, 2018
Function control flow Performance question , performance	22	1520	May 31, 2020
Performance Tips for Combinatorial Problem Performance	16	1130	January 10, 2020

Pushing a function to its performance limit

Related topics