Reducing the number of allocations in a function

andrea_bertozzi · March 13, 2025, 10:54am

Hello, I have this function. When running benchmark, I get 519 allocations. Does anyone know how I can reduce those?

Thank you!

using LinearAlgebra

struct Settings 
    rho::Float64
    g_earth::Vector{Float64}
    cd_tether::Float64
    d_tether::Float64                               # tether diameter                  [mm]
    rho_tether::Float64                             # density of Dyneema            [kg/m³]
    c_spring::Float64   
end

function objFun(stateVec, kitePos, kiteVel, windVel, tetherLength, settings)

    # settings follows same syntax as Settings3 in Tether_09.jl
    g  = abs(settings.g_earth[3])  # in this function, g is considered a scalar 

    Ns = size(windVel, 2)           # number of masses - windVel is a 3xNs matrix
    Ls = tetherLength/(Ns+1)        # segment length
    mj = settings.rho_tether*Ls     # segment mass

    p = kitePos                     # input kite position
    v = kiteVel                     # input kite velocity

    # Extract states
    theta   = stateVec[1]
    phi     = stateVec[2]
    Tn      = stateVec[3]

    # Compute omega_t
    omega_t = cross(p/(norm(p)^2),v);

    # Tether cross section
    A = pi/4*(settings.d_tether/1000)^2 # [m2] 
    # Compute Young's modulus    
    E = settings.c_spring/A

    FT = zeros(3,Ns) # Tension forces
    Fd = zeros(3,Ns) # Drag forces
    pj = zeros(3,Ns) # Mass positions
    vj = zeros(3,Ns) # Mass velocities
    aj = zeros(3,Ns) # Mass accellerations

    # First element from ground station (Ns)
    FT[:,Ns] = Tn.*[sin(theta)* cos(phi); sin(phi); cos(theta)*cos(phi)]
    pj[:,Ns] = Ls.*[sin(theta)* cos(phi); sin(phi); cos(theta)*cos(phi)]
    vj[:,Ns] = dot(v,p/norm(p)) .* (p/norm(p)) + cross(omega_t,pj[:,Ns])
    aj[:,Ns] = cross(omega_t,cross(omega_t,pj[:,Ns]))

    # Drag calculation first element
    v_a_p = vj[:,Ns] - windVel[:,Ns]
    if all(abs.(v_a_p) .< 1e-3)
        Fd[:,Ns] = [0;0;0]
    else
        v_a_p_t = (dot((pj[:,Ns])/norm(pj[:,Ns]),v_a_p)*
            ((pj[:,Ns])/norm(pj[:,Ns])))
        v_a_p_n = v_a_p - v_a_p_t
        Fd[:,Ns] = (-0.5 * settings.rho * Ls * settings.d_tether * settings.cd_tether * 
            norm(v_a_p_n) * v_a_p_n )# particle drag
    end

    # All other segments and masses except for segment connected to the kite
    for ii = Ns:-1:2
        if ii == Ns
            FT[:,ii-1] = (mj+0.5*mj)*aj[:,ii] + FT[:,ii] - Fd[:,ii] + [0;0;(mj+0.5*mj)*g]
        else
            FT[:,ii-1] = mj*aj[:,ii] + FT[:,ii] - Fd[:,ii] + [0;0;mj*g]
        end

        l_i_1 = (norm(FT[:,ii-1])/(E*A) + 1)*Ls
        
        pj[:,ii-1] = pj[:,ii] + l_i_1.*(FT[:,ii-1]/norm(FT[:,ii-1]))       
        vj[:,ii-1] = dot(v,p./norm(p)) * (p./norm(p)) + cross(omega_t,pj[:,ii-1])
        aj[:,ii-1] = cross(omega_t,cross(omega_t,pj[:,ii-1]))
        
        # Drag calculation
        v_a_p = vj[:,ii] - windVel[:,ii]
        if all(abs.(v_a_p) .< 1e-3)
            Fd[:,ii-1] = [0;0;0]
        else
            v_a_p_t = (dot((pj[:,ii-1]-pj[:,ii])/norm(pj[:,ii-1]-pj[:,ii]),v_a_p) * 
                ((pj[:,ii-1]-pj[:,ii])/norm(pj[:,ii-1]-pj[:,ii])))
            v_a_p_n = v_a_p - v_a_p_t;
            Fd[:,ii-1] = -0.5 * settings.rho * Ls * settings.d_tether * settings.cd_tether * norm(v_a_p_n) * v_a_p_n # particle drag
        end
    end

    T0 = (mj+0.5*mj).*aj[:,1] + FT[:,1] - Fd[:,1] + [0;0;(mj+0.5*mj)*g];
    l_i_1 = (norm(T0)/(E*A) + 1)*Ls;
    p0 = pj[:,1] + l_i_1.*(T0/norm(T0));
    
    Fobj = p-p0;      

    return Fobj, T0, pj, p0
end

ufechner7 · March 13, 2025, 10:59am

Can you also share the code to create the required input parameters?

It would be good to share an executable piece of code that we can execute that shows the number of allocations.

andrea_bertozzi · March 13, 2025, 11:07am

You can use this set of parameters to run the function

struct Settings 
    rho::Float64
    g_earth::Vector{Float64}
    cd_tether::Float64
    d_tether::Float64                             
    rho_tether::Float64                           
    c_spring::Float64   
end

stateVec = rand(3,)
kitePos = [100, 100, 300]
kiteVel = [0, 0, 0]
windVel = rand(3,15)
tetherLength = 500
settings = Settings(1.225, [0, 0, -9.806], 0.9, 4, 0.85, 500000)

objFun(stateVec, kitePos, kiteVel, windVel, tetherLength, settings)

ufechner7 · March 13, 2025, 11:17am

I tried your example, and this is what I get:

julia> @benchmark objFun(stateVec, kitePos, kiteVel, windVel, tetherLength, settings)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  10.420 μs …   8.663 ms  ┊ GC (min … max):  0.00% … 99.29%
 Time  (median):     12.153 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   21.091 μs ± 152.445 μs  ┊ GC (mean ± σ):  16.69% ±  2.42%

  ▂▅▇█▇▅▂                                 ▁▂▂▃▃▄▄▄▄▄▃▃▂▁▁▁     ▂
  ████████▇█▇██▆▆▆▅▅▆▆▅▆▄▃▁▁▄▄▄▃▃▃▃▁▁▃▄▅█▇███████████████████▇ █
  10.4 μs       Histogram: log(frequency) by time      36.4 μs <

 Memory estimate: 56.09 KiB, allocs estimate: 1414.

DNF · March 13, 2025, 11:17am

Your function makes allocations because it creates a lot of arrays, there’s no way around that. 512 doesn’t seem like a big number based on a quick eyeballing.

You can change the ones that are inherently 3D into StaticArrays, but you will still have arrays of StaticArrays, so allocations will happen.

ufechner7 · March 13, 2025, 11:21am

I get 1414 allocations with Julia 11, but I think the original poster uses Julia 1.10 and there the number of allocations should be half of it if I remember correctly.

Nevertheless the goal should be to bring down the allocations to zero. I see two options:

using StaticArrays
passing a pre-allocated structure or array for the temporary results to the function

DNF · March 13, 2025, 11:29am

I see another possible approach, though the codebase is a little too big to be certain without a closer look:

Don’t create arrays at all. Instead, just create the last, Nsth, values, and then iteratively update the values, while iterating over ii from Ns and down to 1. I think (without looking closely) that you don’t need to keep the values from the previous iteration.

Additionally, instead of using vectors of length 3, use SVector{3, Float64}.

This requires some re-thinking and a full re-write of your code (including updating your struct), but If you do, I think you could get close to zero allocations.

DNF · March 13, 2025, 12:12pm

You also have some other issues, such as doing the same calculation over and over, e.g. doing norm(p) several times on the same p vector, or this one:

v_a_p_t = (dot((pj[:,ii-1]-pj[:,ii])/norm(pj[:,ii-1]-pj[:,ii]),v_a_p) * 
                ((pj[:,ii-1]-pj[:,ii])/norm(pj[:,ii-1]-pj[:,ii])))

You are calculating pj[:,ii-1]-pj[:,ii] no less than four times, and taking the same norm twice, in a single expression. This is both computationally wasteful, and also adds to your allocations without any good reason.

DNF · March 13, 2025, 12:34pm

Here’s some more:

dot(v,p./norm(p)) * (p./norm(p))

This expression performs 12 multiplications, 6 additions, 6 divisions and 2 square roots as well as 3 vector allocations, while the below one gives the same result:

(dot(v, p) / sum(abs2, p)) * p

but only uses 9 multiplications, 4 additions, 1 division, and zero square roots, and a single allocation. If you use StaticArrays, there will be zero allocations.

And if you look at the full expression in your code:

(dot((pj[:,ii-1]-pj[:,ii])/norm(pj[:,ii-1]-pj[:,ii]),v_a_p) * 
                ((pj[:,ii-1]-pj[:,ii])/norm(pj[:,ii-1]-pj[:,ii])))

it does 12 multiplications, 18 additions, 6 divisions and 2 square roots and 15 vector allocations. And there is also the cost of memory lookup with the indexing.

ufechner7 · March 13, 2025, 3:55pm

I asked perplexity.ai to reduce the number of allocations and got this result:

function objFun(stateVec, kitePos, kiteVel, windVel, tetherLength, settings)
    g = abs(settings.g_earth[3])
    Ns = size(windVel, 2)
    Ls = tetherLength / (Ns + 1)
    mj = settings.rho_tether * Ls
    drag_coeff = -0.5 * settings.rho * Ls * settings.d_tether * settings.cd_tether
    A = π/4 * (settings.d_tether/1000)^2
    E = settings.c_spring / A

    # Preallocate arrays
    FT = zeros(3, Ns)
    Fd = zeros(3, Ns)
    pj = zeros(3, Ns)
    vj = zeros(3, Ns)
    aj = zeros(3, Ns)

    # Unpack state variables
    θ, φ, Tn = stateVec[1], stateVec[2], stateVec[3]

    # Precompute common values
    sinθ = sin(θ)
    cosθ = cos(θ)
    sinφ = sin(φ)
    cosφ = cos(φ)
    norm_p = norm(kitePos)
    p_unit = kitePos ./ norm_p
    v_parallel = dot(kiteVel, p_unit)
    
    # First element calculations
    FT[1, Ns] = Tn * sinθ * cosφ
    FT[2, Ns] = Tn * sinφ
    FT[3, Ns] = Tn * cosθ * cosφ

    pj[1, Ns] = Ls * sinθ * cosφ
    pj[2, Ns] = Ls * sinφ
    pj[3, Ns] = Ls * cosθ * cosφ

    # Velocity and acceleration calculations
    ω = cross(kitePos / norm_p^2, kiteVel)
    @inbounds for k in 1:3
        vj[k, Ns] = v_parallel * p_unit[k] + cross(ω, @view(pj[:, Ns]))[k]
        aj[k, Ns] = cross(ω, cross(ω, @view(pj[:, Ns])))[k]
    end

    # Drag calculation for first element
    v_a_p1 = vj[1, Ns] - windVel[1, Ns]
    v_a_p2 = vj[2, Ns] - windVel[2, Ns]
    v_a_p3 = vj[3, Ns] - windVel[3, Ns]

    if all(x -> abs(x) < 1e-3, (v_a_p1, v_a_p2, v_a_p3))
        Fd[:, Ns] .= 0.0
    else
        dir1, dir2, dir3 = pj[1, Ns]/Ls, pj[2, Ns]/Ls, pj[3, Ns]/Ls
        v_dot_dir = v_a_p1*dir1 + v_a_p2*dir2 + v_a_p3*dir3
        v_a_p_t1 = v_dot_dir * dir1
        v_a_p_t2 = v_dot_dir * dir2
        v_a_p_t3 = v_dot_dir * dir3

        v_a_p_n1 = v_a_p1 - v_a_p_t1
        v_a_p_n2 = v_a_p2 - v_a_p_t2
        v_a_p_n3 = v_a_p3 - v_a_p_t3

        norm_v_a_p_n = sqrt(v_a_p_n1^2 + v_a_p_n2^2 + v_a_p_n3^2)
        coeff = drag_coeff * norm_v_a_p_n

        Fd[1, Ns] = coeff * v_a_p_n1
        Fd[2, Ns] = coeff * v_a_p_n2
        Fd[3, Ns] = coeff * v_a_p_n3
    end

    # Process other segments
    @inbounds for ii in Ns:-1:2
        # Tension force calculations
        if ii == Ns
            mj_total = 1.5mj
            g_term = mj_total * g
        else
            mj_total = mj
            g_term = mj * g
        end

        @views for k in 1:3
            FT[k, ii-1] = mj_total * aj[k, ii] + FT[k, ii] - Fd[k, ii]
        end
        FT[3, ii-1] += g_term

        # Position calculations
        ft_norm = sqrt(FT[1, ii-1]^2 + FT[2, ii-1]^2 + FT[3, ii-1]^2)
        l_i_1 = (ft_norm/(E*A) + 1) * Ls
        ft_dir = FT[1, ii-1]/ft_norm, FT[2, ii-1]/ft_norm, FT[3, ii-1]/ft_norm

        pj[1, ii-1] = pj[1, ii] + l_i_1 * ft_dir[1]
        pj[2, ii-1] = pj[2, ii] + l_i_1 * ft_dir[2]
        pj[3, ii-1] = pj[3, ii] + l_i_1 * ft_dir[3]

        # Velocity and acceleration
        for k in 1:3
            vj[k, ii-1] = v_parallel * p_unit[k] + cross(ω, @view(pj[:, ii-1]))[k]
            aj[k, ii-1] = cross(ω, cross(ω, @view(pj[:, ii-1])))[k]
        end

        # Drag calculations
        v_a_p1 = vj[1, ii] - windVel[1, ii]
        v_a_p2 = vj[2, ii] - windVel[2, ii]
        v_a_p3 = vj[3, ii] - windVel[3, ii]

        if all(x -> abs(x) < 1e-3, (v_a_p1, v_a_p2, v_a_p3))
            Fd[:, ii-1] .= 0.0
        else
            dx = pj[1, ii-1] - pj[1, ii]
            dy = pj[2, ii-1] - pj[2, ii]
            dz = pj[3, ii-1] - pj[3, ii]
            segment_norm = sqrt(dx^2 + dy^2 + dz^2)
            dir1 = dx/segment_norm
            dir2 = dy/segment_norm
            dir3 = dz/segment_norm

            v_dot_dir = v_a_p1*dir1 + v_a_p2*dir2 + v_a_p3*dir3
            v_a_p_t1 = v_dot_dir * dir1
            v_a_p_t2 = v_dot_dir * dir2
            v_a_p_t3 = v_dot_dir * dir3

            v_a_p_n1 = v_a_p1 - v_a_p_t1
            v_a_p_n2 = v_a_p2 - v_a_p_t2
            v_a_p_n3 = v_a_p3 - v_a_p_t3

            norm_v_a_p_n = sqrt(v_a_p_n1^2 + v_a_p_n2^2 + v_a_p_n3^2)
            coeff = drag_coeff * norm_v_a_p_n

            Fd[1, ii-1] = coeff * v_a_p_n1
            Fd[2, ii-1] = coeff * v_a_p_n2
            Fd[3, ii-1] = coeff * v_a_p_n3
        end
    end

    # Final ground connection calculations
    T0_1 = 1.5mj*aj[1,1] + FT[1,1] - Fd[1,1]
    T0_2 = 1.5mj*aj[2,1] + FT[2,1] - Fd[2,1]
    T0_3 = 1.5mj*aj[3,1] + FT[3,1] - Fd[3,1] + 1.5mj*g
    T0_norm = sqrt(T0_1^2 + T0_2^2 + T0_3^2)
    
    l_i_1 = (T0_norm/(E*A) + 1) * Ls
    T0_dir1 = T0_1/T0_norm
    T0_dir2 = T0_2/T0_norm
    T0_dir3 = T0_3/T0_norm

    p0 = [pj[1,1] + l_i_1*T0_dir1, 
          pj[2,1] + l_i_1*T0_dir2,
          pj[3,1] + l_i_1*T0_dir3]

    Fobj = kitePos - p0

    return Fobj, SVector(T0_1, T0_2, T0_3), pj, p0
end

Old: 0.000036 seconds (1.41 k allocations: 56.094 KiB)
New: 0.000012 seconds (291 allocations: 13.188 KiB)

julia> @benchmark objFun(stateVec, kitePos, kiteVel, windVel, tetherLength, settings)
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (min … max):  2.603 μs …  1.103 ms  ┊ GC (min … max):  0.00% … 99.35%
 Time  (median):     2.909 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   4.552 μs ± 25.185 μs  ┊ GC (mean ± σ):  18.26% ±  3.82%

  ▂▅██▆▄▃▃▁▁▁                                     ▂▃▄▄▄▃▂▂▁  ▂
  ████████████▇▇▇▆▆▆▄▆▅▆▃▅▄▄▃▅▅▁▄▃▄▄▃▁▃▁▁▁▁▁▁▁▄▄▇███████████ █
  2.6 μs       Histogram: log(frequency) by time     7.53 μs <

 Memory estimate: 13.19 KiB, allocs estimate: 291.

Not bad, but can probably still be improved.

ufechner7 · March 14, 2025, 9:01am

I put some manual effort into this code and could reduce the number of allocations to two:

using LinearAlgebra, StaticArrays, BenchmarkTools

const MVec3 = MVector{3, Float64}

struct Settings 
    rho::Float64
    g_earth::MVector{3, Float64}
    cd_tether::Float64
    d_tether::Float64                               # tether diameter                  [mm]
    rho_tether::Float64                             # density of Dyneema            [kg/m³]
    c_spring::Float64   
end


function objFun!(res, stateVec, kitePos, kiteVel, windVel, tetherLength, settings, buffers)
    g = abs(settings.g_earth[3])
    Ns = size(windVel, 2)
    Ls = tetherLength / (Ns + 1)
    mj = settings.rho_tether * Ls
    drag_coeff = -0.5 * settings.rho * Ls * settings.d_tether * settings.cd_tether
    A = π/4 * (settings.d_tether/1000)^2
    E = settings.c_spring / A

    # Preallocate arrays
    FT = buffers[1]
    Fd = buffers[2]
    pj = buffers[3]
    vj = buffers[4]
    aj = buffers[5]

    # Unpack state variables
    θ, φ, Tn = stateVec[1], stateVec[2], stateVec[3]

    # Precompute common values
    sinθ = sin(θ)
    cosθ = cos(θ)
    sinφ = sin(φ)
    cosφ = cos(φ)
    norm_p = norm(kitePos)
    p_unit = kitePos ./ norm_p
    v_parallel = dot(kiteVel, p_unit)
    
    # First element calculations
    FT[1, Ns] = Tn * sinθ * cosφ
    FT[2, Ns] = Tn * sinφ
    FT[3, Ns] = Tn * cosθ * cosφ

    pj[1, Ns] = Ls * sinθ * cosφ
    pj[2, Ns] = Ls * sinφ
    pj[3, Ns] = Ls * cosθ * cosφ

    # Velocity and acceleration calculations
    ω = cross(kitePos / norm_p^2, kiteVel) # 3 alloc
    a = cross(ω, MVec3(@view(pj[:, Ns])))         # 3 alloc
    b = cross(ω, cross(ω, MVec3(@view(pj[:, Ns]))))
    vj[:, Ns] .= v_parallel * p_unit + a
    aj[:, Ns] .= b

    # Drag calculation for first element
    v_a_p1 = vj[1, Ns] - windVel[1, Ns]
    v_a_p2 = vj[2, Ns] - windVel[2, Ns]
    v_a_p3 = vj[3, Ns] - windVel[3, Ns]

    if all(x -> abs(x) < 1e-3, (v_a_p1, v_a_p2, v_a_p3))
        Fd[:, Ns] .= 0.0
    else
        dir1, dir2, dir3 = pj[1, Ns]/Ls, pj[2, Ns]/Ls, pj[3, Ns]/Ls
        v_dot_dir = v_a_p1*dir1 + v_a_p2*dir2 + v_a_p3*dir3
        v_a_p_t1 = v_dot_dir * dir1
        v_a_p_t2 = v_dot_dir * dir2
        v_a_p_t3 = v_dot_dir * dir3

        v_a_p_n1 = v_a_p1 - v_a_p_t1
        v_a_p_n2 = v_a_p2 - v_a_p_t2
        v_a_p_n3 = v_a_p3 - v_a_p_t3

        norm_v_a_p_n = sqrt(v_a_p_n1^2 + v_a_p_n2^2 + v_a_p_n3^2)
        coeff = drag_coeff * norm_v_a_p_n

        Fd[1, Ns] = coeff * v_a_p_n1
        Fd[2, Ns] = coeff * v_a_p_n2
        Fd[3, Ns] = coeff * v_a_p_n3
    end

    # Process other segments
    @inbounds for ii in Ns:-1:2
        # Tension force calculations
        if ii == Ns
            mj_total = 1.5mj
            g_term = mj_total * g
        else
            mj_total = mj
            g_term = mj * g
        end

        for k in 1:3
            FT[k, ii-1] = mj_total * aj[k, ii] + FT[k, ii] - Fd[k, ii]
        end
        FT[3, ii-1] += g_term

        # Position calculations
        ft_norm = sqrt(FT[1, ii-1]^2 + FT[2, ii-1]^2 + FT[3, ii-1]^2)
        l_i_1 = (ft_norm/(E*A) + 1) * Ls
        ft_dir = FT[1, ii-1]/ft_norm, FT[2, ii-1]/ft_norm, FT[3, ii-1]/ft_norm

        pj[1, ii-1] = pj[1, ii] + l_i_1 * ft_dir[1]
        pj[2, ii-1] = pj[2, ii] + l_i_1 * ft_dir[2]
        pj[3, ii-1] = pj[3, ii] + l_i_1 * ft_dir[3]

        # Velocity and acceleration
        a = cross(ω, MVec3(@view(pj[:, ii-1])))           # 28 allocations
        b = cross(ω, cross(ω, MVec3(@view(pj[:, ii-1])))) # 28 allocations
        vj[:, ii-1] .= v_parallel * p_unit + a
        aj[:, ii-1] .= b

        # Drag calculations
        v_a_p1 = vj[1, ii] - windVel[1, ii]
        v_a_p2 = vj[2, ii] - windVel[2, ii]
        v_a_p3 = vj[3, ii] - windVel[3, ii]

        if all(x -> abs(x) < 1e-3, (v_a_p1, v_a_p2, v_a_p3))
            Fd[:, ii-1] .= 0.0
        else
            dx = pj[1, ii-1] - pj[1, ii]
            dy = pj[2, ii-1] - pj[2, ii]
            dz = pj[3, ii-1] - pj[3, ii]
            segment_norm = sqrt(dx^2 + dy^2 + dz^2)
            dir1 = dx/segment_norm
            dir2 = dy/segment_norm
            dir3 = dz/segment_norm

            v_dot_dir = v_a_p1*dir1 + v_a_p2*dir2 + v_a_p3*dir3
            v_a_p_t1 = v_dot_dir * dir1
            v_a_p_t2 = v_dot_dir * dir2
            v_a_p_t3 = v_dot_dir * dir3

            v_a_p_n1 = v_a_p1 - v_a_p_t1
            v_a_p_n2 = v_a_p2 - v_a_p_t2
            v_a_p_n3 = v_a_p3 - v_a_p_t3

            norm_v_a_p_n = sqrt(v_a_p_n1^2 + v_a_p_n2^2 + v_a_p_n3^2)
            coeff = drag_coeff * norm_v_a_p_n

            Fd[1, ii-1] = coeff * v_a_p_n1
            Fd[2, ii-1] = coeff * v_a_p_n2
            Fd[3, ii-1] = coeff * v_a_p_n3
        end
    end

    # Final ground connection calculations
    T0_1 = 1.5mj*aj[1,1] + FT[1,1] - Fd[1,1]
    T0_2 = 1.5mj*aj[2,1] + FT[2,1] - Fd[2,1]
    T0_3 = 1.5mj*aj[3,1] + FT[3,1] - Fd[3,1] + 1.5mj*g
    T0_norm = sqrt(T0_1^2 + T0_2^2 + T0_3^2)
    
    l_i_1 = (T0_norm/(E*A) + 1) * Ls
    T0_dir1 = T0_1/T0_norm
    T0_dir2 = T0_2/T0_norm
    T0_dir3 = T0_3/T0_norm

    p0 = [pj[1,1] + l_i_1*T0_dir1, 
          pj[2,1] + l_i_1*T0_dir2,
          pj[3,1] + l_i_1*T0_dir3]

    res .= kitePos - p0
    nothing
end

stateVec = MVector{3}(rand(3,))
kitePos = SVector{3}([100, 100, 300])
kiteVel = SVector{3}([0, 0, 0])
windVel = SMatrix{3, 15}(rand(3,15))
tetherLength = 500
settings = Settings(1.225, [0, 0, -9.806], 0.9, 4, 0.85, 500000)
Ns = size(windVel, 2)
buffers= [zeros(3, Ns), zeros(3, Ns), zeros(3, Ns), zeros(3, Ns), zeros(3, Ns)]
res = zeros(3)

objFun!(res, stateVec, kitePos, kiteVel, windVel, tetherLength, settings, buffers)
@benchmark objFun!(res, stateVec, kitePos, kiteVel, windVel, tetherLength, settings, buffers)

The code is now about 34 times faster than the original code:

julia> include("src/bench_qsm.jl")
BenchmarkTools.Trial: 10000 samples with 186 evaluations per sample.
 Range (min … max):  552.016 ns …  33.646 μs  ┊ GC (min … max): 0.00% … 97.83%
 Time  (median):     556.575 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   570.646 ns ± 463.443 ns  ┊ GC (mean ± σ):  1.27% ±  1.83%

  ▁ ▁██▆▅▃▂▁▁▁▁▁ ▂▄▃▂▂▂▂▁▁▁▁▄▄▃▂▁▁▁▁▁ ▁▁     ▁                  ▂
  █████████████████████████████████████████████▇█▇▆▆▇▆▆▆▆▆▆▆▆▅▆ █
  552 ns        Histogram: log(frequency) by time        606 ns <

 Memory estimate: 80 bytes, allocs estimate: 2.

ufechner7 · March 14, 2025, 9:21am

ufechner7:

    T0_1 = 1.5mj*aj[1,1] + FT[1,1] - Fd[1,1]
    T0_2 = 1.5mj*aj[2,1] + FT[2,1] - Fd[2,1]
    T0_3 = 1.5mj*aj[3,1] + FT[3,1] - Fd[3,1] + 1.5mj*g
    T0_norm = sqrt(T0_1^2 + T0_2^2 + T0_3^2)
    
    l_i_1 = (T0_norm/(E*A) + 1) * Ls
    T0_dir1 = T0_1/T0_norm
    T0_dir2 = T0_2/T0_norm
    T0_dir3 = T0_3/T0_norm

    p0 = [pj[1,1] + l_i_1*T0_dir1, 
          pj[2,1] + l_i_1*T0_dir2,
          pj[3,1] + l_i_1*T0_dir3]

Open question: The code is looking now more verbose than needed:

    T0_1 = 1.5mj*aj[1,1] + FT[1,1] - Fd[1,1]
    T0_2 = 1.5mj*aj[2,1] + FT[2,1] - Fd[2,1]
    T0_3 = 1.5mj*aj[3,1] + FT[3,1] - Fd[3,1] + 1.5mj*g
    T0_norm = sqrt(T0_1^2 + T0_2^2 + T0_3^2)
    
    l_i_1 = (T0_norm/(E*A) + 1) * Ls
    T0_dir1 = T0_1/T0_norm
    T0_dir2 = T0_2/T0_norm
    T0_dir3 = T0_3/T0_norm

    p0 = [pj[1,1] + l_i_1*T0_dir1, 
          pj[2,1] + l_i_1*T0_dir2,
          pj[3,1] + l_i_1*T0_dir3]

How can I vectorize this again without adding memory allocations?

DNF · March 14, 2025, 10:51am

Here’s a version with zero allocations, that does not need any arrays:

using LinearAlgebra
using StaticArrays

struct Settings 
    rho::Float64
    g_earth::SVector{3, Float64}
    cd_tether::Float64
    d_tether::Float64 
    rho_tether::Float64
    c_spring::Float64   
end

function objFun_new(stateVec, kitePos, kiteVel, windVel, tetherLength, settings)

    # settings follows same syntax as Settings3 in Tether_09.jl
    g  = abs(settings.g_earth[3])  # in this function, g is considered a scalar 

    Ns = length(windVel)           # number of masses - windVel is a 3xNs matrix
    Ls = tetherLength / (Ns+1)        # segment length
    mj = settings.rho_tether*Ls     # segment mass

    p = kitePos                     # input kite position
    p_unit = normalize(p)
    v = kiteVel                     # input kite velocity

    # Extract states
    theta   = stateVec[1]
    phi     = stateVec[2]
    Tn      = stateVec[3]

    # Compute omega_t
    omega_t = cross(p / sum(abs2, p), v);

    # Tether cross section
    A = pi/4*(settings.d_tether/1000)^2 # [m2] 
    # Compute Young's modulus    
    E = settings.c_spring/A

    # # First element from ground station (Ns)
    # TODO: a lot of repeated computation below
    FT = Tn .* SA[sin(theta)* cos(phi), sin(phi), cos(theta)*cos(phi)]
    pj = Ls .* SA[sin(theta)* cos(phi), sin(phi), cos(theta)*cos(phi)]
    pj_unit = normalize(pj)
    vj = dot(v, p_unit) * p_unit + cross(omega_t, pj)
    aj = cross(omega_t, cross(omega_t, pj))

    # Drag calculation first element
    v_a_p = vj - windVel[end]
    if all(abs.(v_a_p) .< 1e-3)
        Fd = SA[0.0, 0.0, 0.0]
    else
        v_a_p_t = dot(pj_unit, v_a_p) * pj_unit
        v_a_p_n = v_a_p - v_a_p_t
        Fd = (-0.5 * settings.rho * Ls * settings.d_tether * settings.cd_tether * 
            norm(v_a_p_n) * v_a_p_n )# particle drag
    end

    # All other segments and masses except for segment connected to the kite
    for ii = Ns:-1:2
        if ii == Ns
            FT = (mj+0.5*mj) * aj + FT - Fd + SA[0, 0, (mj+0.5*mj)*g]
        else
            FT = mj*aj + FT - Fd + SA[0, 0, mj*g]
        end

        FT_norm = norm(FT)
        l_i_1 = (FT_norm / (E*A) + 1) * Ls
        
        # Drag calculation
        v_a_p = vj - windVel[ii] # moving this up here

        Δpj = (l_i_1/FT_norm) * FT
        pj += Δpj
        vj = dot(v, p_unit) * p_unit + cross(omega_t, pj)
        aj = cross(omega_t, cross(omega_t, pj))
        
        if all(abs.(v_a_p) .< 1e-3)
            Fd = SA[0.0, 0.0, 0.0]
        else
            v_a_p_t = (dot(Δpj, v_a_p) / sum(abs2, Δpj)) * Δpj
            v_a_p_n = v_a_p - v_a_p_t;
            Fd = -0.5 * settings.rho * Ls * settings.d_tether * settings.cd_tether * norm(v_a_p_n) * v_a_p_n # particle drag
        end
    end

    T0 = (mj+0.5*mj) .* aj + FT - Fd + SA[0, 0, (mj+0.5*mj)*g]
    l_i_1 = (norm(T0)/(E*A) + 1) * Ls
    p0 = pj + l_i_1 .* (T0 / norm(T0))
    
    Fobj = p - p0

    return Fobj, T0, pj, p0
end

stateVec = rand(SVector{3, Float64})
kitePos = SA_F64[100, 100, 300]
kiteVel = SA_F64[0, 0, 0]
windVel = rand(SVector{3, Float64}, 15)
tetherLength = 500.0
settings = Settings(1.225, SA_F64[0, 0, -9.806], 0.9, 4, 0.85, 500000)

objFun(stateVec, kitePos, kiteVel, windVel, tetherLength, settings)

Note that it only returns the final value of pj, not the entire set of 15 values.

Most of the time is spent running the norm function on SVectors, so I don’t think there is that much more performance to be found, unless one can avoid the square root (which I have in some places.)

ufechner7 · March 14, 2025, 12:05pm

Impressive work! I changed your code into a mutating function:

using LinearAlgebra, StaticArrays, BenchmarkTools

struct Settings 
    rho::Float64
    g_earth::SVector{3, Float64}
    cd_tether::Float64
    d_tether::Float64 
    rho_tether::Float64
    c_spring::Float64   
end

function objFun!(res, stateVec, kitePos, kiteVel, windVel, tetherLength, settings)

    # settings follows same syntax as Settings3 in Tether_09.jl
    g  = abs(settings.g_earth[3])  # in this function, g is considered a scalar 

    Ns = length(windVel)           # number of masses - windVel is a 3xNs matrix
    Ls = tetherLength / (Ns+1)        # segment length
    mj = settings.rho_tether*Ls     # segment mass

    p = kitePos                     # input kite position
    p_unit = normalize(p)
    v = kiteVel                     # input kite velocity

    # Extract states
    theta   = stateVec[1]
    phi     = stateVec[2]
    Tn      = stateVec[3]

    # Compute omega_t
    omega_t = cross(p / sum(abs2, p), v);

    # Tether cross section
    A = pi/4*(settings.d_tether/1000)^2 # [m2] 
    # Compute Young's modulus    
    E = settings.c_spring/A

    # # First element from ground station (Ns)
    # TODO: a lot of repeated computation below
    FT = Tn .* SA[sin(theta)* cos(phi), sin(phi), cos(theta)*cos(phi)]
    pj = Ls .* SA[sin(theta)* cos(phi), sin(phi), cos(theta)*cos(phi)]
    pj_unit = normalize(pj)
    vj = dot(v, p_unit) * p_unit + cross(omega_t, pj)
    aj = cross(omega_t, cross(omega_t, pj))

    # Drag calculation first element
    v_a_p = vj - windVel[end]
    if all(abs.(v_a_p) .< 1e-3)
        Fd = SA[0.0, 0.0, 0.0]
    else
        v_a_p_t = dot(pj_unit, v_a_p) * pj_unit
        v_a_p_n = v_a_p - v_a_p_t
        Fd = (-0.5 * settings.rho * Ls * settings.d_tether * settings.cd_tether * 
            norm(v_a_p_n) * v_a_p_n )# particle drag
    end

    # All other segments and masses except for segment connected to the kite
    for ii = Ns:-1:2
        if ii == Ns
            FT = (mj+0.5*mj) * aj + FT - Fd + SA[0, 0, (mj+0.5*mj)*g]
        else
            FT = mj*aj + FT - Fd + SA[0, 0, mj*g]
        end

        FT_norm = norm(FT)
        l_i_1 = (FT_norm / (E*A) + 1) * Ls
        
        # Drag calculation
        v_a_p = vj - windVel[ii] # moving this up here

        Δpj = (l_i_1/FT_norm) * FT
        pj += Δpj
        vj = dot(v, p_unit) * p_unit + cross(omega_t, pj)
        aj = cross(omega_t, cross(omega_t, pj))
        
        if all(abs.(v_a_p) .< 1e-3)
            Fd = SA[0.0, 0.0, 0.0]
        else
            v_a_p_t = (dot(Δpj, v_a_p) / sum(abs2, Δpj)) * Δpj
            v_a_p_n = v_a_p - v_a_p_t;
            Fd = -0.5 * settings.rho * Ls * settings.d_tether * settings.cd_tether * norm(v_a_p_n) * v_a_p_n # particle drag
        end
    end

    T0 = (mj+0.5*mj) .* aj + FT - Fd + SA[0, 0, (mj+0.5*mj)*g]
    l_i_1 = (norm(T0)/(E*A) + 1) * Ls
    p0 = pj + l_i_1 .* (T0 / norm(T0))
    
    res .= p - p0
    nothing
end

stateVec = rand(SVector{3, Float64})
kitePos = SA_F64[100, 100, 300]
kiteVel = SA_F64[0, 0, 0]
windVel = rand(SVector{3, Float64}, 15)
tetherLength = 500.0
settings = Settings(1.225, SA_F64[0, 0, -9.806], 0.9, 4, 0.85, 500000)
res = zeros(3)

objFun!(res, stateVec, kitePos, kiteVel, windVel, tetherLength, settings)
@benchmark objFun!(res, stateVec, kitePos, kiteVel, windVel, tetherLength, settings)

This gives:

julia> include("src/bench_qsm2.jl")
BenchmarkTools.Trial: 10000 samples with 194 evaluations per sample.
 Range (min … max):  443.706 ns … 561.804 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     446.443 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   451.167 ns ±  13.504 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▆█▆▄▃▄▄▁    ▂▂▃▃▁▁                                     ▁ ▂▃  ▂
  █████████▅▇▇▇████████▇▇▇███▇▆▇▃▅▅▅▆▁▆▆▆▃▆▆▁▆▆▃▆▃▅▆▁▇▄▅▅▅█▆███ █
  444 ns        Histogram: log(frequency) by time        502 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Zero allocations! And 43 times faster than the original version.

And your code is much shorter and better readable than the AI generated version.

ufechner7 · March 14, 2025, 12:30pm

This function is a little bit faster, but not really worth it:

@inline @inbounds function norm(vec::SVector{3, Float64})
    sqrt(vec[1]*vec[1]+vec[2]*vec[2]+vec[3]*vec[3])
end

Topic		Replies	Views
Too many allocations? Performance	23	1676	February 18, 2021
Reducing allocations to optimize (to match Java speed) Performance	22	1341	December 10, 2018
Removing undesirable allocations in some functions Performance performance , memory-allocation	15	422	January 27, 2024
Reduce allocations for 3D Arrays General Usage question	10	401	November 2, 2022
Reducing the memory allocation in ODEs Performance question	11	1334	January 22, 2022

Reducing the number of allocations in a function

Related topics