Improve performance of Bellman operator

Hi there,
I am creating a function for the Bellman operator, a common tool in economics and macroeconomics. The function takes an Array of dimension n where n is the number of state variables and iterates it one period, output is therefore an array of dimension n as well.
I am new to Julia but I followed the performance tips I could understand and added @view where applicable. Computation time has improved but it’s still too slow in my opinion, here’s the code

using Distributions
using Random
using BenchmarkTools

function T(V0::Array{Float64, 4}, cost_g::Matrix{Float64}, cost_v::Matrix{Float64}, n_grid_ω::Int64, n_grid_k::Int64, n_grid_c::Int64, P::Array{Float64, 3}, β::Float64, grid_Π::Matrix{Float64})
    VP = zeros(n_grid_ω, n_grid_k, n_grid_c, n_grid_c)
    EV_cg_cv = zeros(n_grid_ω, n_grid_k, n_grid_c, n_grid_c, 3)
    for i in 1:n_grid_ω::Int64, r in 1:3, cg in 1:n_grid_c::Int64, cv in 1:n_grid_c::Int64
        EV_cg_cv[i, :, cg, cv, r] = transpose(@view(P[i,:,r])) * @view(V0[:,:,cg,cv])           #Slow
    EV = zeros(n_grid_ω, n_grid_k, 3)
    EV = dropdims(sum(EV_cg_cv, dims = [3;4])/(n_grid_c)^2, dims = tuple(3,4))
    for cg in 1:n_grid_c::Int64, cv in 1:n_grid_c::Int64, k in 1:n_grid_k::Int64
        cost = [0, cost_g[cg, k], cost_v[cv, k]]
        for i in 1:n_grid_ω
            VP[i, k, cg, cv] = maximum(β* EV[i,k,:] - cost) .+ @view(grid_Π[i, k])              #Slow

nω, nk, nc = [8, 9, 10];
V = rand(nω, nk, nc, nc);
cost = rand(nc, nk);
grid_Π = rand(nω, nk);
P = randn(nω, nω, 3);
julia> @btime T(V, cost, cost, nω, nk, nc, P, 0.9, grid_Π);
  2.318 ms (33333 allocations: 3.22 MiB)

Does this appear reasonable for a problem of this size or is it too slow? I know from profiling that much time is spent on the lines marked as #slow. But these 2 lines are also computed on 4 iterative dimensions so it could be normal after all.
I read that the matrix representation for data (common in Matlab) is not so appropriate in Julia. Anything in that direction ?


1 Like

Did you try to use .= instead of = in these lines?

EV_cg_cv[i, :, cg, cv, r] = transpose(@view(P[i,:,r])) * @view(V0[:,:,cg,cv])           #Slow
VP[i, k, cg, cv] = maximum(β* EV[i,k,:] - cost) .+ @view(grid_Π[i, k]) 

Hi, I don’t exactly understand why I would need to broadcast the = operator.
The result of indexing EV_cg_cv[i, :, cg, cv, r] is a vector and the RHS is also a vector. In the second line VP[i, k, cg, cv] is just a scalar and the RHS is also a scalar.

Should .= be for another reason that I don’t get, I still tried it.
2.947 ms (24433 allocations: 2.14 MiB)
Less memory allocated but takes longer.

Well, for assigning a scalar you do not need .= . But if you assign a vector element wise (using .=) that saves to allocate a new vector and therefore normally saves a lot of time…

I mean, for my performance critical code I try to achieve zero memory allocations by running any loop over pre-allocated arrays. Not sure how important that is for your application.

1 Like

Well, MVectors or SVectors are only useful if the Vectors are small, lets say less than 130 elements…

julia> T(V, cost, cost, nω, nk, nc, P, 0.9, grid_Π) ≈ T_new(V, cost, cost, nω, nk, nc, P, 0.9, grid_Π)

julia> @btime T($V, $cost, $cost, $nω, $nk, $nc, $P, 0.9, $grid_Π);
  2.173 ms (36027 allocations: 2.52 MiB)

julia> @btime T_new($V, $cost, $cost, $nω, $nk, $nc, $P, 0.9, $grid_Π);
  806.028 μs (7223 allocations: 677.48 KiB)

I’ve only tried to reduce obvious allocations that weren’t necessary. I couldn’t get rid of all of them because I couldn’t figure out what you’re doing in the first loop - it doesn’t like me putting .= there due to an additional singleton dimension and using LinearAlgebra.mul! doesn’t work for some reason (not that familiar with this).

There is surely more speed to be had here, but for that I’d have to know what formula exactly this is supposed to calculate.

Your original code reads very MATLAB-y or Python-y due to (in my opinion) excessive explicit vectorization. Consider writing a kernel function that does what you want to do here on the smallest required data instead, and applying that to each individually outside via broadcasting.

Faster code with explanations what I changed
function T_new(V0::Array{Float64, 4}, # I reformatted these to make it easier to read
    VP = zeros(n_ω, n_k, n_c, n_c)
    EV_cg_cv = zeros(n_ω, n_k, n_c, n_c, 3)

    for i in 1:n_ω, r in 1:3, cg in 1:n_c, cv in 1:n_c
        P_view = @view P[i,:,r]
        V0_view = @view V0[:,:,cg,cv]

        # this is the cause of the remaining allocations that I couldn't get rid of
        # an explicit loop would work, but I'm not 100% sure what this is supposed to do
        res = transpose(P_view) * V0_view

        EV_cg_cv[i, :, cg, cv, r] = res 

    # No need to allocate `EV` explicitly when you then just overwrite it with the result of `sum`
    sums = sum(EV_cg_cv, dims = (3,4)) # no need for an array for the dims, a tuple suffices
    sums ./= n_c^2
    EV = dropdims(sums, dims = (3,4)) # this is free either way

    for cg in 1:n_c, cv in 1:n_c, k in 1:n_k # type annotations aren't necessary, those variables are Int already anyway
        cost = (0.0, cost_g[cg, k], cost_v[cv, k]) # again a tuple is perfectly fine here
        for i in 1:n_ω
            m = maximum(axes(EV,3)) do idx # avoid allocating intermediates by just iterating over axes
                β * EV[i,k,idx] - cost[idx] # index into the tuple instead of a costly new allocation
            VP[i, k, cg, cv] = m + grid_Π[i, k] # our result is a scalar anyway, no need for views or broadcasting

    return VP

One additional thing I’d change is not passing in the lengths explicitly - size(arr, dim) is available for querying the existing size of an array in a given dimension, so you don’t have to keep everything consistent manually when you can just take the existing values (it’s cheap to retrieve & saved in each array - we’re not in C or Fortran here!). That makes your function signature much smaller as well, to T(V0, cost_g, cost_v, P, β, grid_Π).

I also haven’t investigated whether your loop order (julia is column major) is correct, so there may be more speed here as well.

Those are not needed here (yet?). Only applying standard allocation elimination is surely sufficient at this stage.


A few important things on views:

Always you profile whether they will help because if the slices are small they can slow things down;

Keep in mind that Julia is column-major ordering, like fortran. So you want to arrange your datastructures so you take views or slices over contiguous memory in performance sensitive code. Not saying you broke that rule but it can be very important.

For example, you may be better off with arrays of matrices in some cases/etc. but it can be tricky to know until you play with it. I would remove all of the explit typing of parameters/arguments/etc. to make that easier. The only thing they can do is accidentally let you shoot yourself in the foot on performance.

There are plenty of places where you could preallocate things before calling the funciton and do everything inplace with no allocations, but as always, better algorithms usually beat micro-optimized code.

1 Like

When you write the for loops in that manner, I believe they are operating backwards relative to what you expect. You want i in your loops to be the faster moving index, but it isn’t. This is the first thing I would change.

julia> for i in 1:10, j in 1:2
       @show (i,j)
(i, j) = (1, 1)
(i, j) = (1, 2)
(i, j) = (2, 1)
(i, j) = (2, 2)
(i, j) = (3, 1)
(i, j) = (3, 2)
(i, j) = (4, 1)
(i, j) = (4, 2)
(i, j) = (5, 1)
(i, j) = (5, 2)
(i, j) = (6, 1)
(i, j) = (6, 2)
(i, j) = (7, 1)
(i, j) = (7, 2)
(i, j) = (8, 1)
(i, j) = (8, 2)
(i, j) = (9, 1)
(i, j) = (9, 2)
(i, j) = (10, 1)
(i, j) = (10, 2)