Static array multiplied by its adjoint seems to allocate

miguelborrero · November 28, 2024, 11:30pm

Hi there,

I am writing out a likelihood estimation and I have run into some allocation issues when coding up the Hessian. Even though the Hessian function HLL seems to mirror what the gradient function ∇LL is doing. One allocates 0 and the other allocates a lot (please see below).

using LinearAlgebra
using StaticArrays
using BenchmarkTools

struct Observation{T<:Real}
    y::T
    x1::T
    x2::T
end

# efficiently turn Observation into a vector of statically known size
to_X(o::Observation) = SVector(1.0, o.x1, o.x2)

# this function will generate the logit data for Q4 
function simulate_logit_data(; n_obs = 300000)
    # define the DGP 
    X1 = 1.0:0.01:8.0
    X2 = 18.0:0.01:40.0
    α = -8.0
    θ_1 = 1.5
    θ_2 = 0.02
    # pre-allocate the container for the data 
    data = Array{Observation{Float64}}(undef, n_obs)
    for i in eachindex(data)
        x1 = rand(X1)
        x2 = rand(X2)
        u = rand()
        y = (α + θ_1*x1 + θ_2*x2 + log(u/(1-u)) > 0) ? 1.0 : 0.0
        data[i] = Observation(y, x1, x2)
    end
    return data
end

# this function will compute the log-likelihood given a parameter space point 
function LL(θ, data)
    ll = zero(eltype(θ))  # give ll the element type of theta, which may be a Dual number
    for i in eachindex(data)
        obs = data[i]
        Xθ = dot(θ, to_X(obs))  # precompute this quantity as a dot product
        ll += -log(1 + exp(Xθ)) + obs.y * Xθ
    end
    return ll/length(data)
end

# compute the gradient 
function ∇LL(
    θ, 
    data
)
    grad = @MArray(zeros(length(θ)))
    for i in eachindex(data)
        obs = data[i]
        X = to_X(obs)
        e = exp(dot(θ, X))
        grad .+= (obs.y - e/(1+e)) * X
    end
    return grad / length(data)
end 

# compute the Hessian 
function HLL(
    θ,
    data
)
    H = @MArray(zeros(length(θ), length(θ)))
    for i in eachindex(data)
        obs = data[i]
        X = to_X(obs)
        e = exp(dot(θ, X))
        H .+= e/(1+e) * X*X'
    end
    return -H / length(data)
end

function main()
    θ = @SVector(rand(3))
    data = simulate_logit_data()
    @btime HLL($θ, $data)
end

main()

Gives the following allocations:

56.739 ms (2700001 allocations: 105.29 MiB)

Can I get some help in understanding what is driving the differences? It seems like its coming from the operation X*X' but they are both static arrays and I checked that their product is a static matrix.

Vasily_Pisarev · November 29, 2024, 9:09am

Probably H = @MArray(zeros(length(θ), length(θ))) is type-unstable.
In your code, it looks like length(θ) can only be 3, so try

function HLL(
    θ,
    data
)
    H = zero(SMatrix{3, 3, Float64, 9})
    for obs in data
        X = to_X(obs)
        e = exp(dot(θ, X))
        H += e/(1+e) * X*X'
    end
    return -H / length(data)
end

Hopefully, it’ll be zero allocations.

Benny · November 29, 2024, 9:40am

I haven’t ran anything to see what allocations happen, but MArrays are still mutable. A mutable object’s mutation must be observed by its possibly multiple references, so that’s a big factor in deciding to allocate on the heap. The compiler can improve to stack-allocate mutable objects of a small fixed size that can be easily proved to not escape the method, but in this case return -H / length(data) instantiates another mutable array that definitely escapes. That is a big reason why Vasily_Pisarev switches to SMatrix. You don’t seem to need it in this case, but mutation-like syntax can be done with Accessors.jl.

julia> using StaticArrays, Accessors, BenchmarkTools

julia> julia> function foo()
         v = @SVector zeros(Int, 3)
         for i in 1:20
           @reset v[rand(1:3)] += 1
         end
         v
       end
foo (generic function with 1 method)

julia> @btime foo()
  69.121 ns (0 allocations: 0 bytes)
3-element SVector{3, Int64} with indices SOneTo(3):
 10
  5
  5

Vasily_Pisarev · November 29, 2024, 11:03am

Also, which Julia version do you use?

On my computer it’s 1 allocation in Julia 1.10.7.
Have you, by chance, not post the exact code? I get the result similar to yours when θ is a non-static vector (I hope I’ve explained the reason for that in the previous reply).

miguelborrero · November 29, 2024, 4:45pm

Its weird, I did post the exact code but when exited the current session and re-ran the code on a new REPL session it did only give me one allocation. Maybe there was some contamination by globals defined in the old session. Thanks a lot.

I tried your approach and it does give me 0 allocations now but at the expense of a slightly less performance (here it is illustrated with the gradient function):

using StaticArrays 
using BenchmarkTools
using LinearAlgebra

struct Observation{T<:Real}
    y::T
    x1::T
    x2::T
end

# efficiently turn Observation into a vector of statically known size
to_X(o::Observation) = SVector(1.0, o.x1, o.x2)

# this function will generate the logit data for Q4 
function simulate_logit_data(; n_obs = 300000)
    # define the DGP 
    X1 = 1.0:0.01:8.0
    X2 = 18.0:0.01:40.0
    α = -8.0
    θ_1 = 1.5
    θ_2 = 0.02
    # pre-allocate the container for the data 
    data = Array{Observation{Float64}}(undef, n_obs)
    for i in eachindex(data)
        x1 = rand(X1)
        x2 = rand(X2)
        u = rand()
        y = (α + θ_1*x1 + θ_2*x2 + log(u/(1-u)) > 0) ? 1.0 : 0.0
        data[i] = Observation(y, x1, x2)
    end
    return data
end

function ∇LL(
    θ, 
    data
)
    grad = @MArray(zeros(length(θ)))
    for i in eachindex(data)
        obs = data[i]
        X = to_X(obs)
        e = exp(dot(θ, X))
        grad .+= (obs.y - e/(1+e)) * X
    end
    return grad / length(data)
end 

# compute the gradient 
function ∇LL_no_allocs(
    θ, 
    data
)
    grad = zero(SVector{length(θ), Float64})
    for i in eachindex(data)
        obs = data[i]
        X = to_X(obs)
        e = exp(dot(θ, X))
        grad += (obs.y - e/(1+e)) * X
    end
    return grad / length(data)
end 

function gradient_descent(
    θ_init,
    data;
    τ = 0.02,
    tol = 1e-5
)   
    iter = 1
    error = 1.0
    θ_sol = @MArray(zeros(length(θ_init)))
    θ_new = @MArray(zeros(length(θ_init)))
    θ_sol .= θ_init
    while error > tol 
        θ_new .= θ_sol + τ * ∇LL(θ_sol, data)
        error = maximum(abs.(θ_new .- θ_sol))
        θ_sol .= θ_new
        iter += 1
    end
    println("Converged in $(iter) iterations.")
    return θ_sol
end

function gradient_descent_no_allocs(
    θ_init,
    data;
    τ = 0.02,
    tol = 1e-5
)   
    iter = 1
    error = 1.0
    θ_sol = @MArray(zeros(length(θ_init)))
    θ_new = @MArray(zeros(length(θ_init)))
    θ_sol .= θ_init
    while error > tol 
        θ_new .= θ_sol + τ * ∇LL_no_allocs(θ_sol, data)
        error = maximum(abs.(θ_new .- θ_sol))
        θ_sol .= θ_new
        iter += 1
    end
    println("Converged in $(iter) iterations.")
    return θ_sol
end

function main()
    θ = @SVector([-7.0, 1.0, 0.0])
    data = simulate_logit_data()
    @btime ∇LL($θ, $data)
    @btime ∇LL_no_allocs($θ, $data)
    @time println(gradient_descent(θ, data))
    @time println(gradient_descent_no_allocs(θ, data))
end

main()

Which outputs:

  1.367 ms (1 allocation: 32 bytes)
  1.465 ms (0 allocations: 0 bytes)
Converged in 33117 iterations.
[-7.759175920850232, 1.4769398872442863, 0.015585149430894672]
 45.740152 seconds (70.97 k allocations: 2.998 MiB, 0.08% compilation time)
Converged in 33117 iterations.
[-7.759175920850232, 1.4769398872442863, 0.015585149430894672]
 48.932687 seconds (38 allocations: 2.133 KiB)

So initially I was thinking that even though the non-allocation version of the gradient was slightly slower, the allocation compounding in the gradient descent algorithm might revert the performance but it seems like this is not the case. Therefore I wonder whether trying to obtain zero allocations might not be always the best metric to follow even when comparing the same algorithm?

Benny · November 29, 2024, 6:27pm

The difference is so small it might be noise rather than anything important. For reference, I @btime-d the 2 versions a few times in as ideal conditions (closed other apps, charging cable plugged) as I can muster:

  1.947 ms (1 allocation: 32 bytes)
  1.947 ms (0 allocations: 0 bytes)

  1.886 ms (1 allocation: 32 bytes)
  1.887 ms (0 allocations: 0 bytes)

  1.968 ms (1 allocation: 32 bytes)
  1.888 ms (0 allocations: 0 bytes)

miguelborrero · November 29, 2024, 6:38pm

Interesting! However, I do think that the consistency between the @btime in my example and how the gradient descent also takes some seconds more in the no-allocation case suggests there might be something else rather than noise. The gradient descent is running each for ~33 000 times and it seems like it is compounding systematically the performance differences.

As you said the differences are very small so in the end it only amounts to a few seconds which is nothing though.

Thanks!

Tetrakai · November 29, 2024, 7:28pm

miguelborrero:

function main()
    θ = @SVector([-7.0, 1.0, 0.0])
    data = simulate_logit_data()
    @btime ∇LL($θ, $data)
    @btime ∇LL_no_allocs($θ, $data)
    @time println(gradient_descent(θ, data))
    @time println(gradient_descent_no_allocs(θ, data))
end

Its probably best to get rid of all the printing when running the benchmarks.
There is compilation in one of the @time benchmarks, which you probably do not want. Either run it once beforehand or use @benchmark
@btime will tell you the minimum benchmark, which isn’t the best when there are allocations. It isn’t the heap allocations per se that really hurt performance, it is the garbage collection (especially if you start running this in parallel). GC is out of your control so its best to use @benchmark and compare the mean. For the settings, see: Manual · BenchmarkTools.jl

Last, try using Accessors.@reset like this:

function gradient_descent_no_allocs2(
    θ_init,
    data;
    τ = 0.02,
    tol = 1e-5
)   
    iter = 1
    error = 1.0
    θ_sol = θ_init
    while error > tol 
        θ_new = θ_sol + τ * ∇LL_no_allocs(θ_sol, data)
        error = maximum(abs.(θ_new .- θ_sol))
        @reset θ_sol = θ_new
        iter += 1
    end
    # println("Converged in $(iter) iterations.")
    return θ_sol
end

I reduced n_obs by 10x to 30k (don’t think it should matter) and got rid of your main function since it didn’t print them all in the repl for me.

Benchmarks:

julia> @benchmark ∇LL($θ, $data)                                                                                 [15/1947]
BenchmarkTools.Trial: 10000 samples with 1 evaluation.                                                                    
 Range (min … max):  279.037 μs … 692.707 μs  ┊ GC (min … max): 0.00% … 0.00%                                             
 Time  (median):     281.027 μs               ┊ GC (median):    0.00%                                                     
 Time  (mean ± σ):   283.640 μs ±  17.411 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%                                             
                                                                                                                          
  █▅  ▆▇▁▁▂▅▂▂▁▃▂▁▂▂▃▄▄▂▃▂▄▂▁▂▁▁ ▁ ▁  ▁                         ▂                                                         
  ██▁▁████████████████████████████▇█▇▇███▆▆▇▆▇▃▆▇▇▄▅▇▆▆▄▄▆▅▆▄▄▅ █                                                         
  279 μs        Histogram: log(frequency) by time        303 μs <                                                         
                                                                                                                          
 Memory estimate: 32 bytes, allocs estimate: 1.                                                                           
                                                                                                                          
julia> @benchmark ∇LL_no_allocs($θ, $data)                                                                                
BenchmarkTools.Trial: 10000 samples with 1 evaluation.                                                                    
 Range (min … max):  284.527 μs … 706.658 μs  ┊ GC (min … max): 0.00% … 0.00%                                             
 Time  (median):     286.568 μs               ┊ GC (median):    0.00%                                                     
 Time  (mean ± σ):   287.919 μs ±  13.338 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇▆  █▅▂▂▅▃▂▂▄▂▂▁▃▁  ▂       ▁                                 ▂
  ██▄▇██████████████████▇▇█▇▆▆█▇▆▃██▃▅▅▆▄▁▅▆▆▅▄▄▄▁▄▅▃▄▄▅▄▃▄▄▄▆▄ █
  285 μs        Histogram: log(frequency) by time        312 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark gradient_descent($θ, $data)            seconds = 20
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  8.007 s …    8.218 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.154 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.126 s ± 108.340 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                       █                █   
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  8.01 s         Histogram: frequency by time         8.22 s <

 Memory estimate: 876.44 KiB, allocs estimate: 28046.

julia> @benchmark gradient_descent_no_allocs($θ, $data)  seconds = 20
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  8.122 s …   8.238 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.123 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.161 s ± 66.756 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                       ▁  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  8.12 s         Histogram: frequency by time        8.24 s <

 Memory estimate: 32 bytes, allocs estimate: 1.

julia> @benchmark gradient_descent_no_allocs2($θ, $data) seconds = 20

BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  8.062 s …   8.089 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.084 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.078 s ± 14.629 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                             █         █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█ ▁
  8.06 s         Histogram: frequency by time        8.09 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

New MWE

using StaticArrays 
using BenchmarkTools
using LinearAlgebra
using Accessors

struct Observation{T<:Real}
    y::T
    x1::T
    x2::T
end

# efficiently turn Observation into a vector of statically known size
to_X(o::Observation) = SVector(1.0, o.x1, o.x2)

# this function will generate the logit data for Q4 
function simulate_logit_data(; n_obs = 300000)
    # define the DGP 
    X1 = 1.0:0.01:8.0
    X2 = 18.0:0.01:40.0
    α = -8.0
    θ_1 = 1.5
    θ_2 = 0.02
    # pre-allocate the container for the data 
    data = Array{Observation{Float64}}(undef, n_obs)
    for i in eachindex(data)
        x1 = rand(X1)
        x2 = rand(X2)
        u = rand()
        y = (α + θ_1*x1 + θ_2*x2 + log(u/(1-u)) > 0) ? 1.0 : 0.0
        data[i] = Observation(y, x1, x2)
    end
    return data
end

function ∇LL(
    θ, 
    data
)
    grad = @MArray(zeros(length(θ)))
    for i in eachindex(data)
        obs = data[i]
        X = to_X(obs)
        e = exp(dot(θ, X))
        grad .+= (obs.y - e/(1+e)) * X
    end
    return grad / length(data)
end 

# compute the gradient 
function ∇LL_no_allocs(
    θ, 
    data
)
    grad = zero(SVector{length(θ), Float64})
    for i in eachindex(data)
        obs = data[i]
        X = to_X(obs)
        e = exp(dot(θ, X))
        grad += (obs.y - e/(1+e)) * X
    end
    return grad / length(data)
end 


function gradient_descent(
    θ_init,
    data;
    τ = 0.02,
    tol = 1e-5
)   
    iter = 1
    error = 1.0
    θ_sol = @MArray(zeros(length(θ_init)))
    θ_new = @MArray(zeros(length(θ_init)))
    θ_sol .= θ_init
    while error > tol 
        θ_new .= θ_sol + τ * ∇LL(θ_sol, data)
        error = maximum(abs.(θ_new .- θ_sol))
        θ_sol .= θ_new
        iter += 1
    end
    # println("Converged in $(iter) iterations.")
    return θ_sol
end

function gradient_descent_no_allocs(
    θ_init,
    data;
    τ = 0.02,
    tol = 1e-5
)   
    iter = 1
    error = 1.0
    θ_sol = @MArray(zeros(length(θ_init)))
    θ_new = @MArray(zeros(length(θ_init)))
    θ_sol .= θ_init
    while error > tol 
        θ_new .= θ_sol + τ * ∇LL_no_allocs(θ_sol, data)
        error = maximum(abs.(θ_new .- θ_sol))
        θ_sol .= θ_new
        iter += 1
    end
    # println("Converged in $(iter) iterations.")
    return θ_sol
end

function gradient_descent_no_allocs2(
    θ_init,
    data;
    τ = 0.02,
    tol = 1e-5
)   
    iter = 1
    error = 1.0
    θ_sol = θ_init
    while error > tol 
        θ_new = θ_sol + τ * ∇LL_no_allocs(θ_sol, data)
        error = maximum(abs.(θ_new .- θ_sol))
        @reset θ_sol = θ_new
        iter += 1
    end
    # println("Converged in $(iter) iterations.")
    return θ_sol
end


θ = @SVector([-7.0, 1.0, 0.0]);
data = simulate_logit_data(n_obs = 30000);
@benchmark ∇LL($θ, $data)
@benchmark ∇LL_no_allocs($θ, $data)
@benchmark gradient_descent($θ, $data)            seconds = 20
@benchmark gradient_descent_no_allocs($θ, $data)  seconds = 20
@benchmark gradient_descent_no_allocs2($θ, $data) seconds = 20


x = gradient_descent(θ, data)
y = gradient_descent_no_allocs(θ, data)
z = gradient_descent_no_allocs2(θ, data)

x == y == z

It does seem like ∇LL_no_allocs may be slightly slower in your case, which is interesting. I see zero (very little?) GC-time, so a few small allocations may not matter. However, if you ever want to do this in parallel (even just multi-thread the gradient calc), you will see a big difference if/when GC gets triggered.

miguelborrero · November 29, 2024, 10:08pm

Tetrakai:

    iter = 1
    error = 1.0
    θ_sol = θ_init
    while error > tol 
        θ_new = θ_sol + τ * ∇LL_no_allocs(θ_sol, data)
        error = maximum(abs.(θ_new .- θ_sol))
        @reset θ_sol = θ_new
        iter += 1
    end
    # println("Converged in $(iter) iterations.")
    return θ_sol

Very interesting! Thanks a lot! I did not know about the Accessors.jl existence. Also I believed that heap allocations where the main objective to minimize and ignored GC, so thanks also for that awareness notice.

Let me ask you about the multi-threading comment. Normally, the way I would go about multi-threading here is by pre-allocating an array and parallelize across its indices but with an implementation like the one proposed where this would not work due to immutability, wouldn’t we be running into data-race issues? I’m curious about this.

Thanks a lot.

Tetrakai · November 29, 2024, 10:17pm

I’d usually split it up into nthreads chunks, then do a final sum at the end. But there is also this: Multi-Threading · The Julia Language

I haven’t tried using it yet, but would love to know how Threads.atomic_add!() works for you.

Edit:

Also, Im referring to parallelizing the gradient calculation, not the final gradient descent.

miguelborrero · November 29, 2024, 10:22pm

Ofc ofc I understood what you meant, it only makes sense to do so in the gradient calculation.

Thanks! I will look into.

Tetrakai · November 30, 2024, 6:14pm

I played with it a bit, you can try this for zero-allocation multithreading. Its just reusing the same vector of gradients. I used 30 threads/chunks.

function ∇LL_par(θ, data, grads, idx_chunks)
    ndata  = length(data)
    @batch for i in eachindex(grads)
        grads[i] = SVector(0.0, 0.0, 0.0)
        for j in idx_chunks[i]
            obs = data[j]
            X = to_X(obs)
            e = exp(dot(θ, X))
            grads[i] += (obs.y - e/(1+e)) * X
        end
    end
    return @SVector [sum(grads[i][j] for i in eachindex(grads))/ndata for j in 1:3]
end

function gradient_descent_par(θ_init, data, grads, idx_chunks; τ = 0.02, tol = 1e-5)   
    iter = 1
    error = 1.0
    θ_sol = θ_init
    while error > tol 
        θ_new = θ_sol + τ * ∇LL_par(θ_sol, data, grads, idx_chunks)
        error = maximum(abs.(θ_new .- θ_sol))
        @reset θ_sol = θ_new
        iter += 1
    end
    # println("Converged in $(iter) iterations.")
    return θ_sol
end


θ          = @SVector [-7.0, 1.0, 0.0];
data       = simulate_logit_data(n_obs = 30000);
nchunks    = 30
idx_chunks = index_chunks(1:length(data), n = nchunks);
grads0     = @MVector [SVector(0.0, 0.0, 0.0) for _ in 1:nchunks];


@benchmark ∇LL_no_allocs($θ, $data)
@benchmark ∇LL_par($θ, $data, $grads0, $idx_chunks)

@benchmark gradient_descent_no_allocs2($θ, $data) seconds = 20
@benchmark gradient_descent_par($θ, $data, $grads0, $idx_chunks) seconds = 20

Benchmarks:

julia> @benchmark ∇LL_no_allocs($θ, $data)                                                                                
BenchmarkTools.Trial: 10000 samples with 1 evaluation.                                                                    
 Range (min … max):  280.796 μs … 629.116 μs  ┊ GC (min … max): 0.00% … 0.00%                                             
 Time  (median):     284.687 μs               ┊ GC (median):    0.00%                                                     
 Time  (mean ± σ):   287.641 μs ±  20.585 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%                                             
                                                                                                                          
  ▄▃ ▃█▁ ▇▇▁▁▇▂▂▃▅▂▂▃▂▁▁▃▁  ▃   ▂   ▁  ▁▁                       ▂                                                         
  ██▄███▇████████████████████▇▆██▇▇██▅▆██▆▆▇█▆▇▆▆▆▅▆▇▅▅▄▅▅▁▃▄▄▃ █                                                         
  281 μs        Histogram: log(frequency) by time        310 μs <                                                         
                                                                                                                          
 Memory estimate: 0 bytes, allocs estimate: 0.                                                                            
                                                                                                                          
julia> @benchmark ∇LL_par($θ, $data, $grads0, $idx_chunks)                                                                
BenchmarkTools.Trial: 10000 samples with 1 evaluation.                                                                    
 Range (min … max):  85.582 μs … 287.277 μs  ┊ GC (min … max): 0.00% … 0.00%                                              
 Time  (median):     92.612 μs               ┊ GC (median):    0.00%                                                      
 Time  (mean ± σ):   95.644 μs ±   9.853 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%                                              

       ▄▇██▅▂                                                    
  ▂▂▃▅████████▇▅▅▄▄▄▃▃▄▄▃▃▃▃▂▂▂▂▂▂▁▂▂▂▁▂▂▁▂▂▂▂▂▂▂▂▃▃▃▃▂▃▂▂▃▂▂▂ ▃
  85.6 μs         Histogram: frequency by time          132 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark gradient_descent_no_allocs2($θ, $data) seconds = 20
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  9.133 s …  9.142 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.136 s             ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.137 s ± 4.273 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                  █                                   █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  9.13 s        Histogram: frequency by time        9.14 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark gradient_descent_par($θ, $data, $grads0, $idx_chunks) seconds = 20
BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  3.049 s …   3.091 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.064 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.069 s ± 17.705 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  █        █       █                         █     █   █  
  █▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁█ ▁
  3.05 s         Histogram: frequency by time        3.09 s <

 Memory estimate: 0 bytes, allocs estimate: 0.

Full Script

using StaticArrays 
using BenchmarkTools
using LinearAlgebra
using Accessors
using ChunkSplitters, Polyester

struct Observation{T<:Real}
    y::T
    x1::T
    x2::T
end

# efficiently turn Observation into a vector of statically known size
to_X(o::Observation) = SVector(1.0, o.x1, o.x2)

# this function will generate the logit data for Q4 
function simulate_logit_data(; n_obs = 300000)
    # define the DGP 
    X1 = 1.0:0.01:8.0
    X2 = 18.0:0.01:40.0
    α = -8.0
    θ_1 = 1.5
    θ_2 = 0.02
    # pre-allocate the container for the data 
    data = Array{Observation{Float64}}(undef, n_obs)
    for i in eachindex(data)
        x1 = rand(X1)
        x2 = rand(X2)
        u = rand()
        y = (α + θ_1*x1 + θ_2*x2 + log(u/(1-u)) > 0) ? 1.0 : 0.0
        data[i] = Observation(y, x1, x2)
    end
    return data
end

function ∇LL(
    θ, 
    data
)
    grad = @MArray(zeros(length(θ)))
    for i in eachindex(data)
        obs = data[i]
        X = to_X(obs)
        e = exp(dot(θ, X))
        grad .+= (obs.y - e/(1+e)) * X
    end
    return grad / length(data)
end 

# compute the gradient 
function ∇LL_no_allocs(
    θ, 
    data
)
    grad = zero(SVector{length(θ), Float64})
    for i in eachindex(data)
        obs = data[i]
        X = to_X(obs)
        e = exp(dot(θ, X))
        grad += (obs.y - e/(1+e)) * X
    end
    return grad / length(data)
end 


function gradient_descent(
    θ_init,
    data;
    τ = 0.02,
    tol = 1e-5
)   
    iter = 1
    error = 1.0
    θ_sol = @MArray(zeros(length(θ_init)))
    θ_new = @MArray(zeros(length(θ_init)))
    θ_sol .= θ_init
    while error > tol 
        θ_new .= θ_sol + τ * ∇LL(θ_sol, data)
        error = maximum(abs.(θ_new .- θ_sol))
        θ_sol .= θ_new
        iter += 1
    end
    # println("Converged in $(iter) iterations.")
    return θ_sol
end

function gradient_descent_no_allocs(
    θ_init,
    data;
    τ = 0.02,
    tol = 1e-5
)   
    iter = 1
    error = 1.0
    θ_sol = @MArray(zeros(length(θ_init)))
    θ_new = @MArray(zeros(length(θ_init)))
    θ_sol .= θ_init
    while error > tol 
        θ_new .= θ_sol + τ * ∇LL_no_allocs(θ_sol, data)
        error = maximum(abs.(θ_new .- θ_sol))
        θ_sol .= θ_new
        iter += 1
    end
    # println("Converged in $(iter) iterations.")
    return θ_sol
end

function gradient_descent_no_allocs2(
    θ_init,
    data;
    τ = 0.02,
    tol = 1e-5
)   
    iter = 1
    error = 1.0
    θ_sol = θ_init
    while error > tol 
        θ_new = θ_sol + τ * ∇LL_no_allocs(θ_sol, data)
        error = maximum(abs.(θ_new .- θ_sol))
        @reset θ_sol = θ_new
        iter += 1
    end
    # println("Converged in $(iter) iterations.")
    return θ_sol
end


####################
### New Versions ###
####################
function ∇LL_par(θ, data, grads, idx_chunks)
    ndata  = length(data)
    @batch for i in eachindex(grads)
        grads[i] = SVector(0.0, 0.0, 0.0)
        for j in idx_chunks[i]
            obs = data[j]
            X = to_X(obs)
            e = exp(dot(θ, X))
            grads[i] += (obs.y - e/(1+e)) * X
        end
    end
    return @SVector [sum(grads[i][j] for i in eachindex(grads))/ndata for j in 1:3]
end

function gradient_descent_par(θ_init, data, grads, idx_chunks; τ = 0.02, tol = 1e-5)   
    iter = 1
    error = 1.0
    θ_sol = θ_init
    while error > tol 
        θ_new = θ_sol + τ * ∇LL_par(θ_sol, data, grads, idx_chunks)
        error = maximum(abs.(θ_new .- θ_sol))
        @reset θ_sol = θ_new
        iter += 1
    end
    # println("Converged in $(iter) iterations.")
    return θ_sol
end


θ          = @SVector [-7.0, 1.0, 0.0];
data       = simulate_logit_data(n_obs = 30000);
nchunks    = 30
idx_chunks = index_chunks(1:length(data), n = nchunks);
grads0     = @MVector [SVector(0.0, 0.0, 0.0) for _ in 1:nchunks];


@benchmark ∇LL_no_allocs($θ, $data)
@benchmark ∇LL_par($θ, $data, $grads0, $idx_chunks)

@benchmark gradient_descent_no_allocs2($θ, $data) seconds = 20
@benchmark gradient_descent_par($θ, $data, $grads0, $idx_chunks) seconds = 20


x = gradient_descent_no_allocs2(θ, data)
y = gradient_descent_par(θ, data, grads0, idx_chunks)

x == y

There seems to be a small rounding error when summing the batches though.

miguelborrero · December 2, 2024, 3:54pm

Thanks for taking the time! I could not see the performance improvement in my laptop but its interesting to see your approach. At some point I will try and mirror it using multithreading.

Thanks again!

Topic		Replies	Views
Help with allocations (variable length least squares) New to Julia memory-allocation	10	1158	March 25, 2021
Request guidance in optimising the performance of my implementation (running time and allocation) General Usage question	9	974	April 30, 2018
Help to reduce number of allocations General Usage	23	5489	January 6, 2019
Common allocation mistakes Performance memory-allocation	47	7116	August 21, 2023
Inexplicable allocations when summing `StaticArrays` Performance	8	1126	December 3, 2018

Static array multiplied by its adjoint seems to allocate

Related topics