I want to improve the performance of u
in the code below. I have tried rewriting the formula and inlining the broadcasting, which helped some. But when tried using @turbo
from LoopVectorization.jl something goes wrong and I fear that I’m misunderstanding something fundamental as the results are only vaguely similar. However, I’m not married to the idea of using @turbo
—I just want the function to be performant.
Edit: While the code below generates c
and s
as orthogonal vectors, the function should be able to handle broadcasting and so c
could for instance be a scalar while s
was a three dimensional array, or c
a matrix and s
a vector.
using LoopVectorization
using BenchmarkTools
function u(c, s, α, σ)
if σ == 1.0
return log(c^α * s^(1.0 - α))
else
return ((c^α * s^(1 - α))^(1 - σ) - 1) / (1 - σ)
end
end
function u2(c, s, α, σ)
if σ == 1.0
return α * log(c) + (1.0 - α) * log(s)
else
return (c^(α * (1.0 - σ)) * s^((1.0 - α) * (1.0 - σ)) - 1) / (1.0 - σ)
end
end
function u3(c, s, α, σ)
if σ == 1.0
return @. α * log(c) + (1.0 - α) * log(s)
else
return @. (c^(α * (1.0 - σ)) * s^((1.0 - α) * (1.0 - σ)) - 1) / (1.0 - σ)
end
end
function u4(c, s, α, σ)
if σ == 1.0
return @turbo @. α * log(c) + (1.0 - α) * log(s)
else
return @turbo @. (c^(α * (1.0 - σ)) * s^((1.0 - α) * (1.0 - σ)) - 1) / (1.0 - σ)
end
end
α = 0.3
σ = 2.0
c = collect(range(0, 10, 9))
s = Matrix(transpose(collect(range(0, 10, 11))))
@btime u.($c, $s, Ref($α), Ref($σ))
# 10.900 μs (1 allocation: 896 bytes)
@btime u2.($c, $s, Ref($α), Ref($σ))
# 7.200 μs (1 allocation: 896 bytes)
@btime u3($c, $s, $α, $σ)
# 6.725 μs (1 allocation: 896 bytes)
@btime u4($c, $s, $α, $σ)
# 1.810 μs (1 allocation: 896 bytes)
u_vals = u.(c, s, Ref(α), Ref(σ))
u2_vals = u2.(c, s, Ref(α), Ref(σ))
u3_vals = u3(c, s, α, σ)
u4_vals = u4(c, s, α, σ)
u_vals ≈ u2_vals
# true
u_vals ≈ u3_vals
# true
u_vals ≈ u4_vals
# false