# Improve performance of CRRA utility function

I want to improve the performance of `u` in the code below. I have tried rewriting the formula and inlining the broadcasting, which helped some. But when tried using `@turbo` from LoopVectorization.jl something goes wrong and I fear that I’m misunderstanding something fundamental as the results are only vaguely similar. However, I’m not married to the idea of using `@turbo`—I just want the function to be performant.

Edit: While the code below generates `c` and `s` as orthogonal vectors, the function should be able to handle broadcasting and so `c` could for instance be a scalar while `s` was a three dimensional array, or `c` a matrix and `s` a vector.

``````using LoopVectorization
using BenchmarkTools

function u(c, s, α, σ)
if σ == 1.0
return log(c^α * s^(1.0 - α))
else
return ((c^α * s^(1 - α))^(1 - σ) - 1) / (1 - σ)
end
end

function u2(c, s, α, σ)
if σ == 1.0
return α * log(c) + (1.0 - α) * log(s)
else
return (c^(α * (1.0 - σ)) * s^((1.0 - α) * (1.0 - σ)) - 1) / (1.0 - σ)
end
end

function u3(c, s, α, σ)
if σ == 1.0
return @. α * log(c) + (1.0 - α) * log(s)
else
return @. (c^(α * (1.0 - σ)) * s^((1.0 - α) * (1.0 - σ)) - 1) / (1.0 - σ)
end
end

function u4(c, s, α, σ)
if σ == 1.0
return @turbo @. α * log(c) + (1.0 - α) * log(s)
else
return @turbo @. (c^(α * (1.0 - σ)) * s^((1.0 - α) * (1.0 - σ)) - 1) / (1.0 - σ)
end
end

α = 0.3
σ = 2.0
c = collect(range(0, 10, 9))
s = Matrix(transpose(collect(range(0, 10, 11))))

@btime u.(\$c, \$s, Ref(\$α), Ref(\$σ))
# 10.900 μs (1 allocation: 896 bytes)
@btime u2.(\$c, \$s, Ref(\$α), Ref(\$σ))
# 7.200 μs (1 allocation: 896 bytes)
@btime u3(\$c, \$s, \$α, \$σ)
# 6.725 μs (1 allocation: 896 bytes)
@btime u4(\$c, \$s, \$α, \$σ)
# 1.810 μs (1 allocation: 896 bytes)

u_vals = u.(c, s, Ref(α), Ref(σ))
u2_vals = u2.(c, s, Ref(α), Ref(σ))
u3_vals = u3(c, s, α, σ)
u4_vals = u4(c, s, α, σ)

u_vals ≈ u2_vals
# true
u_vals ≈ u3_vals
# true
u_vals ≈ u4_vals
# false
``````

Switch to Julia 1.9.0-DEV and you will get a 2X speedup for free.

``````function u3(c, s, α, σ)
σ == 1.0 && return @. α * log(c) + (1 - α) * log(s)
return @. (c^(α * (1 - σ)) * s^((1 - α) * (1 - σ)) - 1) / (1 - σ)
end

Julia 1.9.0-DEV
#   3.753 μs (1 allocation: 896 bytes)
# u_vals ≈ u3_vals = true

Julia 1.7.0
#   6.760 μs (1 allocation: 896 bytes)
# u_vals ≈ u3_vals = true
``````
2 Likes

Maybe you can experiment with `muladd` and some of the functions in `LogExpFunctions.jl`.

1 Like

Cool! But that sadly isn’t an option. I don’t have control over the version. It’s 1.7.2 for the foreseeable future.

I don’t know why `@turbo` did not work with broadcast, but you can use it with explicit loops.

``````function u5(c, s, α, σ)
ret = Matrix{eltype(c)}(undef, length(c), length(s))
if σ == 1.0
@turbo for i in eachindex(c), j in eachindex(s)
ret[i,j] = α * log(c[i]) + (1.0 - α) * log(s[j])
end
else
@turbo for i in eachindex(c), j in eachindex(s)
ret[i,j] = (c[i]^(α * (1.0 - σ)) * s[j]^((1.0 - α) * (1.0 - σ)) - 1) / (1.0 - σ)
end
end
ret
end
``````
``````  509.794 ns (1 allocation: 896 bytes)
``````

That is interesting… But the function should be able to handle broadcasting and should not explicitly assume that `c` and `s` are orthogonal vectors. Sorry for the omission. I will add that in the question.

1 Like

`muladd` didn’t make a difference and `LogExpFunctions.xlogy`, while faster, gave exactly the same values as when I tried to use `LoopVectorization.@turbo`