# Pushing a function to its performance limit

Hi! I am trying to push this code to be as efficient as possible. If I could achieve any performance gain it would be great because I am calling this function thousands of time.

The problem is essentially a problem of backward induction problem. In the last period, I need to choose the option that maximizes some payoff and in the periods before I need to maximize a payoff knowing that in the future I am behaving “optimally”.

If you consider there is a more efficient solution “paradigm” or something aside from small perfomance tweaks I would be super interested as well.

``````using BenchmarkTools
using ProgressMeter
using BitOperations
using Distributions
using LinearAlgebra

J = 1000
T = 200
Π = 100*rand(J,T)
Dᵢ = rand([0],J,1)
feat1 = rand([1,2,3,4,5,6,7,8],J,1)
feat2 = rand([1,2,3,4,5,6,7,8,9,10],J,1)
Φ1 = 10
Φ2 = 10
Sj = 20*rand(J,T)
Fj = 20*rand(J,T)

Π=Π.-Sj.-Fj

# This is the Backward Induction Algorithm with independent choices

@views function myfun2(Π,Fj,Sj,prob,SeqShocks,Dᵢ,β,J,T)
D = zeros(Int64,2,2,T,J)
v = zeros(Float64,2,2,T,J)
Dstar = zeros(Int64,J,T+1)
Vstar = zeros(Float64,J,T+1)
Dstar[:,1]=Dᵢ[1:J]
S = [0 1]
Shock=[0 999999999999999999999999999]
func2inner1!(Π, Sj, S, D, v, prob, Shock, β, J, T)
funct2inner2!(Dstar, Vstar, D, v, SeqShocks, T, J)
return Dstar[:,2:end], Vstar[:,2:end]
end

function func2inner1!(Π, Sj, S, D, v, prob, Shock, β, J, T)
Ev = zeros(Float64,2,T,J)
@inbounds for t ∈ 0:T-1, j ∈ 1:J, i ∈ 1:2
for ii ∈ 1:2
if t==0 && ii == 1
payoff  = Π[j,T-t]+Sj[j,T-t]*S[i]-Shock[ii]
D[i,ii,T-t,j] = (payoff>0)
v[i,ii,T-t,j] = D[i,ii,T-t,j]*payoff
elseif ii==1
payoff_enter = Π[j,T-t]+Sj[j,T-t]*S[i]+β*Ev[2,T-t+1,j]
payoff_exit  = β*Ev[1,T-t+1,j]
d=(payoff_enter>payoff_exit)
D[i,ii,T-t,j]   = d
v[i,ii,T-t,j]   = d*(payoff_enter)+(1-d)*payoff_exit
else
D[i,ii,T-t,j]   = 0
payoff_exit  = β*Ev[1,T-t+1,j]
v[i,ii,T-t,j]   = payoff_exit
end
end
Ev[i,T-t,j]=prob*v[i,1,T-t,j]+(1-prob)*v[i,2,T-t,j]
end
end

function funct2inner2!(Dstar, Vstar, D, v, SeqShocks, T, J)
@inbounds for j=1:J, t=2:T+1
Dstar[j,t]=D[Dstar[j,t-1]+1,SeqShocks[j,t-1],t-1,j]
Vstar[j,t]=v[Dstar[j,t-1]+1,SeqShocks[j,t-1],t-1,j]
end
end

println("TIMES")
p = 0.9
SeqShocks = (rand(J,T).>p).+1

BI = @btime myfun2(Π,Fj,Sj,p,SeqShocks,Dᵢ,0.9,J,T)
``````

Have you profiled it to see what the bottlenecks are?

2 Likes

Yes, here it goes!

``````13 .\task.jl:333; (::Atom.var"#19#21"{Array{Any,1}})()
13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\comm.jl:164; handlemsg(::Dict{String,Any}, ::Dict{String,Any})
13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:86; (::Atom.var"#124#129")(::Dict{String,Any})
13 C:\Users\jmcastro\.julia\packages\Media\ItEPc\src\dynamic.jl:24; macro expansion
13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:91; macro expansion
13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\repl.jl:85; hideprompt(::Atom.var"#125#130"{String,Int64,String,Bool})
13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:92; #125
13 .\logging.jl:395; with_logstate(::Atom.var"#126#131"{String,Int64,String,Bool}, ::Base.CoreLogging.LogState)
13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:93; #126
13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:47; withpath(::Function, ::String)
13 C:\Users\jmcastro\.julia\packages\CodeTools\sf1Tz\src\utils.jl:30; withpath(::Atom.var"#127#132"{String,Int64,String,Bool}, ::String)
13 C:\Users\jmcastro\.julia\packages\Atom\lBERI\src\eval.jl:94; (::Atom.var"#127#132"{String,Int64,String,Bool})()
13 C:\Users\jmcastro\.julia\packages\CodeTools\sf1Tz\src\eval.jl:30; include_string(::Module, ::String, ::String, ::Int64)
8 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:31; myfun2(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,2}, ::Float64, ::Array{Int64,2}, ::Array{Int64,2}, ::Float64,...
3 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:0; func2inner1!(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Int64,2}, ::Array{Int64,4}, ::Array{Float64,4}, ::Float64, ::Ar...
5 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:45; func2inner1!(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Int64,2}, ::Array{Int64,4}, ::Array{Float64,4}, ::Float64, ::Ar...
1 .\float.jl:405; *
4 .\operators.jl:529; +
4 .\float.jl:401; +
4 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:32; myfun2(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,2}, ::Float64, ::Array{Int64,2}, ::Array{Int64,2}, ::Float64,...
3 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:61; funct2inner2!(::Array{Int64,2}, ::Array{Float64,2}, ::Array{Int64,4}, ::Array{Float64,4}, ::Array{Int64,2}, ::Int64, ::Int64)
1 C:\Users\jmcastro\Dropbox\Projects\julia_codes\jjj.jl:63; funct2inner2!(::Array{Int64,2}, ::Array{Float64,2}, ::Array{Int64,4}, ::Array{Float64,4}, ::Array{Int64,2}, ::Int64, ::Int64)
1 .\array.jl:745; getindex
``````

It looks like you are doing value function iteration on a Bellman equation in finite horizon. As far as I know, the fastest way to solve it is to parallelize what you can.

5 Likes

You’re right I am solving a finite horizon dynamic problem, however I don’t need to do value function iteration. Also, the reason I can’t parallelize this is that it is already part of a function that is being paralleized.

Obtener Outlook para iOS

Maybe I am wrong, but I would try to put what you do in the global scope in a main function and then declare all variable types there. Probably it also helps to do this within the other functions, as well as declaring the types of the function arguments and return values. At least there are such hints in the performance tips of Julia (and I also observed sometimes very slow execution times when variable types etc. are not declared). Also be careful when using packages, since some functions my return type ::Any, and this may result in super slow code.

No. Not at all.

2 Likes

Where is it mentioned in the performance tips that one should annotate types everywhere? It is not necessary for performance, especially not in function signatures.

You seem to imply here that this is a common case. While possible, this is super rare - and almost certainly a bug in the respective package. Are you really suggesting to check every called method from external packages?

3 Likes

I can’t get around to looking at the entire code now, but I have to ask: I’ve seen this once before, but where did you get the idea to use the `rand` function to generate zeros? (You should use `zeros(Int, J)` or `fill(0, J)`.)

Aside from that, make sure that your inner loops are over the first dimension, since Julia Arrays are column major.

2 Likes

Sorry for that imprecision. I’ve meant that one should avoid working with global variables (and there is a section about that in the performance tips).

Yes and no. As I understood it, Julia tries to find a good balance between ease of use and performance. Depending on the use case, developers may put preference on either the one or the other. But users might have the opposite preference.

Julia is greedy, it tries to be easy to use and performant. Don’t worry, most packages care a lot about performance.

2 Likes

Thanks. I just figured out that there is an issue since the order was not being “respected” between the output and input arrays. Anyways, any other recommendation is deeply appreciated.

Hi all! This is my current version of the MWE code after @DNF 's comments. It is around 1.25x faster than before. However, I’d still love to make some performance enhancements because I am calling this function around 100,000 per iteration so I really want to make sure there is no way to improve it and that there are not obvious flaws. Keep in mind I cannot parallelize this because it is embedded in an algorithm which is being parallelized.
using BenchmarkTools
using ProgressMeter
using BitOperations
using Distributions
using LinearAlgebra

``````J = 10000
T = 200
Π = 100*rand(J,T)
Dᵢ = zeros(Int, J)
feat1 = rand([1,2,3,4,5,6,7,8],J,1)
feat2 = rand([1,2,3,4,5,6,7,8,9,10],J,1)
Φ1 = 10
Φ2 = 10
Sj = 20*rand(J,T)
Fj = 20*rand(J,T)

Π=Π.-Sj.-Fj

# This is the Backward Induction Algorithm with independent choices

@views function myfun2(Π,Fj,Sj,prob,SeqShocks,Dᵢ,β,J,T)
D = zeros(Int64,2,2,J,T)
v = zeros(Float64,2,2,J,T)
Dstar = zeros(Int64,J,T+1)
Vstar = zeros(Float64,J,T+1)
Dstar[:,1]=Dᵢ[1:J]
S = [0 1]
Shock=[0 999999999999999999999999999]
func2inner1!(Π, Sj, S, D, v, prob, Shock, β, J, T)
funct2inner2!(Dstar, Vstar, D, v, SeqShocks, T, J)
return Dstar[:,2:end], Vstar[:,2:end]
end

function func2inner1!(Π, Sj, S, D, v, prob, Shock, β, J, T)
Ev = zeros(Float64,2,J,T)
@inbounds for t ∈ 0:T-1, j ∈ 1:J, i ∈ 1:2
for ii ∈ 1:2
if t==0 && ii == 1
payoff  = Π[j,T-t]+Sj[j,T-t]*S[i]-Shock[ii]
D[i,ii,j,T-t] = (payoff>0)
v[i,ii,j,T-t] = D[i,ii,j,T-t]*payoff

elseif  t==0 && ii == 2

elseif ii==1
payoff_enter = Π[j,T-t]+Sj[j,T-t]*S[i]+β*Ev[2,j,T-t+1]
payoff_exit  = β*Ev[1,j,T-t+1]
d=(payoff_enter>payoff_exit)
D[i,ii,j,T-t]   = d
v[i,ii,j,T-t]   = d*(payoff_enter)+(1-d)*payoff_exit
else
D[i,ii,j,T-t]   = 0
payoff_exit  = β*Ev[1,j,T-t+1]
v[i,ii,j,T-t]   = payoff_exit
end
end
Ev[i,j,T-t]=prob*v[i,1,j,T-t]+(1-prob)*v[i,2,j,T-t]
end
end

function funct2inner2!(Dstar, Vstar, D, v, SeqShocks, T, J)
@inbounds for j=1:J, t=2:T+1
Dstar[j,t]=D[Dstar[j,t-1]+1,SeqShocks[j,t-1],j,t-1]
Vstar[j,t]=v[Dstar[j,t-1]+1,SeqShocks[j,t-1],j,t-1]
end
end

println("TIMES")
p = 0.9
SeqShocks = (rand(J,T).>p).+1

BI = @btime myfun2(Π,Fj,Sj,p,SeqShocks,Dᵢ,0.9,J,T)
``````
1 Like

This seems like a fruitful avenue to explore! Thanks

Now, that I see this, the only index that can be reordered is the j index, so I am not sure that this is going to be super useful.

I think that all the conditions in the loop make it hard to vectorize, even for LopVectorization. I would try to rewrite it in order to remove the statements that are to be run the first iteration only and put them outside the loops. The `ii` loop can be manually unrolled to get rid of it completely.

If you do get the compiler to emit SIMD instructions, you could potentially speed it up by ~4x. You may also consider running it with `Float32` if you do not need the extra accuracy. The smaller bitwidth make twice as much data fit into cache and if it vectorizes, you get twice the throughput.

1 Like