Improve performance of CRRA utility function

I want to improve the performance of u in the code below. I have tried rewriting the formula and inlining the broadcasting, which helped some. But when tried using @turbo from LoopVectorization.jl something goes wrong and I fear that I’m misunderstanding something fundamental as the results are only vaguely similar. However, I’m not married to the idea of using @turbo—I just want the function to be performant.

Edit: While the code below generates c and s as orthogonal vectors, the function should be able to handle broadcasting and so c could for instance be a scalar while s was a three dimensional array, or c a matrix and s a vector.

using LoopVectorization
using BenchmarkTools

function u(c, s, α, σ)
    if σ == 1.0
        return log(c^α * s^(1.0 - α))
    else
        return ((c^α * s^(1 - α))^(1 - σ) - 1) / (1 - σ)
    end
end

function u2(c, s, α, σ)
    if σ == 1.0
        return α * log(c) + (1.0 - α) * log(s)
    else
        return (c^(α * (1.0 - σ)) * s^((1.0 - α) * (1.0 - σ)) - 1) / (1.0 - σ)
    end
end

function u3(c, s, α, σ)
    if σ == 1.0
        return @. α * log(c) + (1.0 - α) * log(s)
    else
        return @. (c^(α * (1.0 - σ)) * s^((1.0 - α) * (1.0 - σ)) - 1) / (1.0 - σ)
    end
end

function u4(c, s, α, σ)
    if σ == 1.0
        return @turbo @. α * log(c) + (1.0 - α) * log(s)
    else
        return @turbo @. (c^(α * (1.0 - σ)) * s^((1.0 - α) * (1.0 - σ)) - 1) / (1.0 - σ)
    end
end

α = 0.3
σ = 2.0
c = collect(range(0, 10, 9))
s = Matrix(transpose(collect(range(0, 10, 11))))

@btime u.($c, $s, Ref($α), Ref($σ))
# 10.900 μs (1 allocation: 896 bytes)
@btime u2.($c, $s, Ref($α), Ref($σ))
# 7.200 μs (1 allocation: 896 bytes)
@btime u3($c, $s, $α, $σ)
# 6.725 μs (1 allocation: 896 bytes)
@btime u4($c, $s, $α, $σ)
# 1.810 μs (1 allocation: 896 bytes)

u_vals = u.(c, s, Ref(α), Ref(σ))
u2_vals = u2.(c, s, Ref(α), Ref(σ))
u3_vals = u3(c, s, α, σ)
u4_vals = u4(c, s, α, σ)

u_vals ≈ u2_vals
# true
u_vals ≈ u3_vals
# true
u_vals ≈ u4_vals
# false

Switch to Julia 1.9.0-DEV and you will get a 2X speedup for free.

function u3(c, s, α, σ)
    σ == 1.0 && return @. α * log(c) + (1 - α) * log(s)
    return @. (c^(α * (1 - σ)) * s^((1 - α) * (1 - σ)) - 1) / (1 - σ)
end

Julia 1.9.0-DEV
#   3.753 μs (1 allocation: 896 bytes)
# u_vals ≈ u3_vals = true

Julia 1.7.0
#   6.760 μs (1 allocation: 896 bytes)
# u_vals ≈ u3_vals = true
2 Likes

Maybe you can experiment with muladd and some of the functions in LogExpFunctions.jl.

1 Like

Cool! But that sadly isn’t an option. I don’t have control over the version. It’s 1.7.2 for the foreseeable future.

I don’t know why @turbo did not work with broadcast, but you can use it with explicit loops.

function u5(c, s, α, σ)
    ret = Matrix{eltype(c)}(undef, length(c), length(s))
    if σ == 1.0
        @turbo for i in eachindex(c), j in eachindex(s)
            ret[i,j] = α * log(c[i]) + (1.0 - α) * log(s[j])
        end
    else
        @turbo for i in eachindex(c), j in eachindex(s)
            ret[i,j] = (c[i]^(α * (1.0 - σ)) * s[j]^((1.0 - α) * (1.0 - σ)) - 1) / (1.0 - σ)
        end
    end
    ret
end
  509.794 ns (1 allocation: 896 bytes)

That is interesting… :thinking: But the function should be able to handle broadcasting and should not explicitly assume that c and s are orthogonal vectors. Sorry for the omission. I will add that in the question.

1 Like

muladd didn’t make a difference and LogExpFunctions.xlogy, while faster, gave exactly the same values as when I tried to use LoopVectorization.@turbo :thinking: