I am writing out a likelihood estimation and I have run into some allocation issues when coding up the Hessian. Even though the Hessian function HLL seems to mirror what the gradient function ∇LL is doing. One allocates 0 and the other allocates a lot (please see below).
using LinearAlgebra
using StaticArrays
using BenchmarkTools
struct Observation{T<:Real}
y::T
x1::T
x2::T
end
# efficiently turn Observation into a vector of statically known size
to_X(o::Observation) = SVector(1.0, o.x1, o.x2)
# this function will generate the logit data for Q4
function simulate_logit_data(; n_obs = 300000)
# define the DGP
X1 = 1.0:0.01:8.0
X2 = 18.0:0.01:40.0
α = -8.0
θ_1 = 1.5
θ_2 = 0.02
# pre-allocate the container for the data
data = Array{Observation{Float64}}(undef, n_obs)
for i in eachindex(data)
x1 = rand(X1)
x2 = rand(X2)
u = rand()
y = (α + θ_1*x1 + θ_2*x2 + log(u/(1-u)) > 0) ? 1.0 : 0.0
data[i] = Observation(y, x1, x2)
end
return data
end
# this function will compute the log-likelihood given a parameter space point
function LL(θ, data)
ll = zero(eltype(θ)) # give ll the element type of theta, which may be a Dual number
for i in eachindex(data)
obs = data[i]
Xθ = dot(θ, to_X(obs)) # precompute this quantity as a dot product
ll += -log(1 + exp(Xθ)) + obs.y * Xθ
end
return ll/length(data)
end
# compute the gradient
function ∇LL(
θ,
data
)
grad = @MArray(zeros(length(θ)))
for i in eachindex(data)
obs = data[i]
X = to_X(obs)
e = exp(dot(θ, X))
grad .+= (obs.y - e/(1+e)) * X
end
return grad / length(data)
end
# compute the Hessian
function HLL(
θ,
data
)
H = @MArray(zeros(length(θ), length(θ)))
for i in eachindex(data)
obs = data[i]
X = to_X(obs)
e = exp(dot(θ, X))
H .+= e/(1+e) * X*X'
end
return -H / length(data)
end
function main()
θ = @SVector(rand(3))
data = simulate_logit_data()
@btime HLL($θ, $data)
end
main()
Gives the following allocations:
56.739 ms (2700001 allocations: 105.29 MiB)
Can I get some help in understanding what is driving the differences? It seems like its coming from the operation X*X' but they are both static arrays and I checked that their product is a static matrix.
Probably H = @MArray(zeros(length(θ), length(θ))) is type-unstable.
In your code, it looks like length(θ) can only be 3, so try
function HLL(
θ,
data
)
H = zero(SMatrix{3, 3, Float64, 9})
for obs in data
X = to_X(obs)
e = exp(dot(θ, X))
H += e/(1+e) * X*X'
end
return -H / length(data)
end
I haven’t ran anything to see what allocations happen, but MArrays are still mutable. A mutable object’s mutation must be observed by its possibly multiple references, so that’s a big factor in deciding to allocate on the heap. The compiler can improve to stack-allocate mutable objects of a small fixed size that can be easily proved to not escape the method, but in this case return -H / length(data) instantiates another mutable array that definitely escapes. That is a big reason why Vasily_Pisarev switches to SMatrix. You don’t seem to need it in this case, but mutation-like syntax can be done with Accessors.jl.
julia> using StaticArrays, Accessors, BenchmarkTools
julia> julia> function foo()
v = @SVector zeros(Int, 3)
for i in 1:20
@reset v[rand(1:3)] += 1
end
v
end
foo (generic function with 1 method)
julia> @btime foo()
69.121 ns (0 allocations: 0 bytes)
3-element SVector{3, Int64} with indices SOneTo(3):
10
5
5
On my computer it’s 1 allocation in Julia 1.10.7.
Have you, by chance, not post the exact code? I get the result similar to yours when θ is a non-static vector (I hope I’ve explained the reason for that in the previous reply).
Its weird, I did post the exact code but when exited the current session and re-ran the code on a new REPL session it did only give me one allocation. Maybe there was some contamination by globals defined in the old session. Thanks a lot.
I tried your approach and it does give me 0 allocations now but at the expense of a slightly less performance (here it is illustrated with the gradient function):
using StaticArrays
using BenchmarkTools
using LinearAlgebra
struct Observation{T<:Real}
y::T
x1::T
x2::T
end
# efficiently turn Observation into a vector of statically known size
to_X(o::Observation) = SVector(1.0, o.x1, o.x2)
# this function will generate the logit data for Q4
function simulate_logit_data(; n_obs = 300000)
# define the DGP
X1 = 1.0:0.01:8.0
X2 = 18.0:0.01:40.0
α = -8.0
θ_1 = 1.5
θ_2 = 0.02
# pre-allocate the container for the data
data = Array{Observation{Float64}}(undef, n_obs)
for i in eachindex(data)
x1 = rand(X1)
x2 = rand(X2)
u = rand()
y = (α + θ_1*x1 + θ_2*x2 + log(u/(1-u)) > 0) ? 1.0 : 0.0
data[i] = Observation(y, x1, x2)
end
return data
end
function ∇LL(
θ,
data
)
grad = @MArray(zeros(length(θ)))
for i in eachindex(data)
obs = data[i]
X = to_X(obs)
e = exp(dot(θ, X))
grad .+= (obs.y - e/(1+e)) * X
end
return grad / length(data)
end
# compute the gradient
function ∇LL_no_allocs(
θ,
data
)
grad = zero(SVector{length(θ), Float64})
for i in eachindex(data)
obs = data[i]
X = to_X(obs)
e = exp(dot(θ, X))
grad += (obs.y - e/(1+e)) * X
end
return grad / length(data)
end
function gradient_descent(
θ_init,
data;
τ = 0.02,
tol = 1e-5
)
iter = 1
error = 1.0
θ_sol = @MArray(zeros(length(θ_init)))
θ_new = @MArray(zeros(length(θ_init)))
θ_sol .= θ_init
while error > tol
θ_new .= θ_sol + τ * ∇LL(θ_sol, data)
error = maximum(abs.(θ_new .- θ_sol))
θ_sol .= θ_new
iter += 1
end
println("Converged in $(iter) iterations.")
return θ_sol
end
function gradient_descent_no_allocs(
θ_init,
data;
τ = 0.02,
tol = 1e-5
)
iter = 1
error = 1.0
θ_sol = @MArray(zeros(length(θ_init)))
θ_new = @MArray(zeros(length(θ_init)))
θ_sol .= θ_init
while error > tol
θ_new .= θ_sol + τ * ∇LL_no_allocs(θ_sol, data)
error = maximum(abs.(θ_new .- θ_sol))
θ_sol .= θ_new
iter += 1
end
println("Converged in $(iter) iterations.")
return θ_sol
end
function main()
θ = @SVector([-7.0, 1.0, 0.0])
data = simulate_logit_data()
@btime ∇LL($θ, $data)
@btime ∇LL_no_allocs($θ, $data)
@time println(gradient_descent(θ, data))
@time println(gradient_descent_no_allocs(θ, data))
end
main()
Which outputs:
1.367 ms (1 allocation: 32 bytes)
1.465 ms (0 allocations: 0 bytes)
Converged in 33117 iterations.
[-7.759175920850232, 1.4769398872442863, 0.015585149430894672]
45.740152 seconds (70.97 k allocations: 2.998 MiB, 0.08% compilation time)
Converged in 33117 iterations.
[-7.759175920850232, 1.4769398872442863, 0.015585149430894672]
48.932687 seconds (38 allocations: 2.133 KiB)
So initially I was thinking that even though the non-allocation version of the gradient was slightly slower, the allocation compounding in the gradient descent algorithm might revert the performance but it seems like this is not the case. Therefore I wonder whether trying to obtain zero allocations might not be always the best metric to follow even when comparing the same algorithm?
The difference is so small it might be noise rather than anything important. For reference, I @btime-d the 2 versions a few times in as ideal conditions (closed other apps, charging cable plugged) as I can muster:
1.947 ms (1 allocation: 32 bytes)
1.947 ms (0 allocations: 0 bytes)
1.886 ms (1 allocation: 32 bytes)
1.887 ms (0 allocations: 0 bytes)
1.968 ms (1 allocation: 32 bytes)
1.888 ms (0 allocations: 0 bytes)
Interesting! However, I do think that the consistency between the @btime in my example and how the gradient descent also takes some seconds more in the no-allocation case suggests there might be something else rather than noise. The gradient descent is running each for ~33 000 times and it seems like it is compounding systematically the performance differences.
As you said the differences are very small so in the end it only amounts to a few seconds which is nothing though.
Its probably best to get rid of all the printing when running the benchmarks.
There is compilation in one of the @time benchmarks, which you probably do not want. Either run it once beforehand or use @benchmark
@btime will tell you the minimum benchmark, which isn’t the best when there are allocations. It isn’t the heap allocations per se that really hurt performance, it is the garbage collection (especially if you start running this in parallel). GC is out of your control so its best to use @benchmark and compare the mean. For the settings, see: Manual · BenchmarkTools.jl
Last, try using Accessors.@reset like this:
function gradient_descent_no_allocs2(
θ_init,
data;
τ = 0.02,
tol = 1e-5
)
iter = 1
error = 1.0
θ_sol = θ_init
while error > tol
θ_new = θ_sol + τ * ∇LL_no_allocs(θ_sol, data)
error = maximum(abs.(θ_new .- θ_sol))
@reset θ_sol = θ_new
iter += 1
end
# println("Converged in $(iter) iterations.")
return θ_sol
end
I reduced n_obs by 10x to 30k (don’t think it should matter) and got rid of your main function since it didn’t print them all in the repl for me.
Benchmarks:
julia> @benchmark ∇LL($θ, $data) [15/1947]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 279.037 μs … 692.707 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 281.027 μs ┊ GC (median): 0.00%
Time (mean ± σ): 283.640 μs ± 17.411 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅ ▆▇▁▁▂▅▂▂▁▃▂▁▂▂▃▄▄▂▃▂▄▂▁▂▁▁ ▁ ▁ ▁ ▂
██▁▁████████████████████████████▇█▇▇███▆▆▇▆▇▃▆▇▇▄▅▇▆▆▄▄▆▅▆▄▄▅ █
279 μs Histogram: log(frequency) by time 303 μs <
Memory estimate: 32 bytes, allocs estimate: 1.
julia> @benchmark ∇LL_no_allocs($θ, $data)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 284.527 μs … 706.658 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 286.568 μs ┊ GC (median): 0.00%
Time (mean ± σ): 287.919 μs ± 13.338 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▇▆ █▅▂▂▅▃▂▂▄▂▂▁▃▁ ▂ ▁ ▂
██▄▇██████████████████▇▇█▇▆▆█▇▆▃██▃▅▅▆▄▁▅▆▆▅▄▄▄▁▄▅▃▄▄▅▄▃▄▄▄▆▄ █
285 μs Histogram: log(frequency) by time 312 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark gradient_descent($θ, $data) seconds = 20
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min … max): 8.007 s … 8.218 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.154 s ┊ GC (median): 0.00%
Time (mean ± σ): 8.126 s ± 108.340 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
8.01 s Histogram: frequency by time 8.22 s <
Memory estimate: 876.44 KiB, allocs estimate: 28046.
julia> @benchmark gradient_descent_no_allocs($θ, $data) seconds = 20
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min … max): 8.122 s … 8.238 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.123 s ┊ GC (median): 0.00%
Time (mean ± σ): 8.161 s ± 66.756 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
8.12 s Histogram: frequency by time 8.24 s <
Memory estimate: 32 bytes, allocs estimate: 1.
julia> @benchmark gradient_descent_no_allocs2($θ, $data) seconds = 20
BenchmarkTools.Trial: 3 samples with 1 evaluation.
Range (min … max): 8.062 s … 8.089 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.084 s ┊ GC (median): 0.00%
Time (mean ± σ): 8.078 s ± 14.629 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█ ▁
8.06 s Histogram: frequency by time 8.09 s <
Memory estimate: 0 bytes, allocs estimate: 0.
New MWE
using StaticArrays
using BenchmarkTools
using LinearAlgebra
using Accessors
struct Observation{T<:Real}
y::T
x1::T
x2::T
end
# efficiently turn Observation into a vector of statically known size
to_X(o::Observation) = SVector(1.0, o.x1, o.x2)
# this function will generate the logit data for Q4
function simulate_logit_data(; n_obs = 300000)
# define the DGP
X1 = 1.0:0.01:8.0
X2 = 18.0:0.01:40.0
α = -8.0
θ_1 = 1.5
θ_2 = 0.02
# pre-allocate the container for the data
data = Array{Observation{Float64}}(undef, n_obs)
for i in eachindex(data)
x1 = rand(X1)
x2 = rand(X2)
u = rand()
y = (α + θ_1*x1 + θ_2*x2 + log(u/(1-u)) > 0) ? 1.0 : 0.0
data[i] = Observation(y, x1, x2)
end
return data
end
function ∇LL(
θ,
data
)
grad = @MArray(zeros(length(θ)))
for i in eachindex(data)
obs = data[i]
X = to_X(obs)
e = exp(dot(θ, X))
grad .+= (obs.y - e/(1+e)) * X
end
return grad / length(data)
end
# compute the gradient
function ∇LL_no_allocs(
θ,
data
)
grad = zero(SVector{length(θ), Float64})
for i in eachindex(data)
obs = data[i]
X = to_X(obs)
e = exp(dot(θ, X))
grad += (obs.y - e/(1+e)) * X
end
return grad / length(data)
end
function gradient_descent(
θ_init,
data;
τ = 0.02,
tol = 1e-5
)
iter = 1
error = 1.0
θ_sol = @MArray(zeros(length(θ_init)))
θ_new = @MArray(zeros(length(θ_init)))
θ_sol .= θ_init
while error > tol
θ_new .= θ_sol + τ * ∇LL(θ_sol, data)
error = maximum(abs.(θ_new .- θ_sol))
θ_sol .= θ_new
iter += 1
end
# println("Converged in $(iter) iterations.")
return θ_sol
end
function gradient_descent_no_allocs(
θ_init,
data;
τ = 0.02,
tol = 1e-5
)
iter = 1
error = 1.0
θ_sol = @MArray(zeros(length(θ_init)))
θ_new = @MArray(zeros(length(θ_init)))
θ_sol .= θ_init
while error > tol
θ_new .= θ_sol + τ * ∇LL_no_allocs(θ_sol, data)
error = maximum(abs.(θ_new .- θ_sol))
θ_sol .= θ_new
iter += 1
end
# println("Converged in $(iter) iterations.")
return θ_sol
end
function gradient_descent_no_allocs2(
θ_init,
data;
τ = 0.02,
tol = 1e-5
)
iter = 1
error = 1.0
θ_sol = θ_init
while error > tol
θ_new = θ_sol + τ * ∇LL_no_allocs(θ_sol, data)
error = maximum(abs.(θ_new .- θ_sol))
@reset θ_sol = θ_new
iter += 1
end
# println("Converged in $(iter) iterations.")
return θ_sol
end
θ = @SVector([-7.0, 1.0, 0.0]);
data = simulate_logit_data(n_obs = 30000);
@benchmark ∇LL($θ, $data)
@benchmark ∇LL_no_allocs($θ, $data)
@benchmark gradient_descent($θ, $data) seconds = 20
@benchmark gradient_descent_no_allocs($θ, $data) seconds = 20
@benchmark gradient_descent_no_allocs2($θ, $data) seconds = 20
x = gradient_descent(θ, data)
y = gradient_descent_no_allocs(θ, data)
z = gradient_descent_no_allocs2(θ, data)
x == y == z
It does seem like ∇LL_no_allocs may be slightly slower in your case, which is interesting. I see zero (very little?) GC-time, so a few small allocations may not matter. However, if you ever want to do this in parallel (even just multi-thread the gradient calc), you will see a big difference if/when GC gets triggered.
Very interesting! Thanks a lot! I did not know about the Accessors.jl existence. Also I believed that heap allocations where the main objective to minimize and ignored GC, so thanks also for that awareness notice.
Let me ask you about the multi-threading comment. Normally, the way I would go about multi-threading here is by pre-allocating an array and parallelize across its indices but with an implementation like the one proposed where this would not work due to immutability, wouldn’t we be running into data-race issues? I’m curious about this.