Allocations and time of running the same program twice are orders of magnitude larger than running them separately

I am relatively new to Julia. I am building a code to optimize some objective function that is called many times. The most important intermediate step is computing the following function, called recover_δ! which is a fixed point iteration implemented using the squarem algorithm.

What I cannot understand is why running the same function recover_δ! twice generates 81 times more allocations (6k to 504k) than running the function once, and takes 218 times (5.7 ms vs 1.2 s) longer! Here is what I mean.

Running the code twice separately

Running the function twice but in separate parts does the following:

δ_test = copy(δ_logit)
β_test1 = [-3.5;  -0.8; zeros(n_β-2)]
Xβ_init1 = X_nl * β_test1
time1 = @benchmark recover_δ!($δ_test, $Xβ_init1)

δ_test = copy(δ_logit)
β_test2 = [-2.0;  -0; zeros(n_β-2)]
Xβ_init2 = X_nl * β_test2
time2 = @benchmark recover_δ!($δ_test, $Xβ_init2)

And gives as output (for time1 and time2, respectively):

BenchmarkTools.Trial: 870 samples with 1 evaluation.
 Range (min … max):  3.544 ms … 14.373 ms  ┊ GC (min … max):  0.00% … 73.78%
 Time  (median):     4.273 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   5.743 ms ±  2.922 ms  ┊ GC (mean ± σ):  29.20% ± 26.51%

  █▇▄▂▃▂                                                      
  ███████▃▂▂▁▁▁▁▁▁▂▂▁▂▂▂▁▂▃▂▂▃▃▂▂▃▃▃▃▃▃▃▂▃▃▃▃▃▃▃▂▃▃▃▃▃▃▂▂▂▂▂ ▃
  3.54 ms        Histogram: frequency by time        13.2 ms <

 Memory estimate: 40.66 MiB, allocs estimate: 6138.


BenchmarkTools.Trial: 883 samples with 1 evaluation.
 Range (min … max):  3.528 ms … 15.513 ms  ┊ GC (min … max):  0.00% … 71.82%
 Time  (median):     4.208 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   5.655 ms ±  2.910 ms  ┊ GC (mean ± σ):  29.08% ± 26.45%

  █▇▅▅▆▆▅                            ▁▁▁    ▁  ▁              
  ███████▇▁▁▁▁▁▁▁▁▆▅▁▄▅▁▆▆▆▆▆▇▆▇█▇▇██████▆▆██▇███▅███▇▇▆▆▅▁▆ █
  3.53 ms      Histogram: log(frequency) by time     13.2 ms <

 Memory estimate: 40.66 MiB, allocs estimate: 6192.

Running the code once altogether

On the other hand, if we define a function that runs it twice, from the same starting point, say:

Xβ_init1 = X_nl * β_test1
Xβ_init2 = X_nl * β_test2

function run_twice(Xβ_init1, Xβ_init2)

    δ_test = copy(δ_logit) # added to emphasize that they are starting from the same point
    recover_δ!(δ_test, Xβ_init1)

    δ_test = copy(δ_logit) # added to emphasize that they are starting from the same point
    recover_δ!(δ_test, Xβ_init2)

end

@benchmark run_twice($Xβ_init1, $Xβ_init2)

The above code produces massively different results.

BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  1.178 s …   1.272 s  ┊ GC (min … max): 18.20% … 22.65%
 Time  (median):     1.239 s              ┊ GC (median):    19.96%
 Time  (mean ± σ):   1.231 s ± 37.123 ms  ┊ GC (mean ± σ):  20.27% ±  2.11%

  █                    █               █        █         █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█ ▁
  1.18 s         Histogram: frequency by time        1.27 s <

 Memory estimate: 7.81 GiB, allocs estimate: 503622.

I just don’t get why. I am completely lost on why running the same code twice can produce massively different results than running it once, but two different times. Running the function run_twice without the δ_test = copy(δ_logit) lines produces almost the same results as the above (mean time and allocs estimates are virtually the same, if not higher).

I just don’t get what I am doing wrong, and how to optimize the code further. The code for this function is below:

function recover_δ!(δ, Xβ)

    # Initialize variables for ForwardDiff
    len_δ = length(log_share)

    type_Xβ = eltype(Xβ)

    δk_new = zeros(type_Xβ, len_δ);
    δ1 = zeros(type_Xβ, len_δ);
    δ2 = zeros(type_Xβ, len_δ);
    δk = zeros(type_Xβ, len_δ);

    rk = zeros(type_Xβ, len_δ);
    vk = zeros(type_Xβ, len_δ);

    σk_agg = zeros(type_Xβ, len_δ);
    σ1_agg = zeros(type_Xβ, len_δ);

    σk_ind = similar(Xβ);
    σ1_ind = similar(Xβ);


    Threads.@threads for i in eachindex(mkt_starts_chunks)

        # Select observations of the full data
        idx        = mkt_starts_chunks[i]:mkt_ends_chunks[i]

        # Get IDs of things in the data
        id_alt     = @view id_fd_alt[idx]
        id_ind     = @view id_fd_ind[idx];
        id_dt      = id_alt[1]:id_alt[end];

        # Get slices to work on 
        wt_mkt     = @view wt_per_mkt[idx];
        Xβ_slice   = @view Xβ[idx]
        sh_slice   = @view log_share[id_dt];
        NoOut_slice = @view NoOut[id_dt];
        δ_slice    = @view δ[id_dt];

        # Get slices for δ auxiliary variables
        δk_slice = @view δk[id_dt];
        δk_new_slice = @view δk_new[id_dt];
        δ1_slice = @view δ1[id_dt];
        δ2_slice = @view δ2[id_dt][NoOut_slice];

        # Get slices of rk, vk
        rk_slice = @view rk[id_dt][NoOut_slice]
        vk_slice = @view vk[id_dt][NoOut_slice]

        # Get slices for σk, and σ1
        σk_agg_slice = @view σk_agg[id_dt]
        σ1_agg_slice = @view σ1_agg[id_dt]

        # Get slices of σk_ind and σ1_ind
        σk_ind_slice = @view σk_ind[idx]
        σ1_ind_slice = @view σ1_ind[idx]

        # Renormalize for sum
        id_alt     = id_alt .- id_alt[1] .+ 1;
        id_ind     = id_ind .- id_ind[1] .+ 1;

        # Get new values of delta
        squarem_δ!(δ_slice, sh_slice, Xβ_slice, id_alt, id_ind, wt_mkt, NoOut_slice, i, 
            rk_slice, vk_slice, σk_agg_slice, σ1_agg_slice, σk_ind_slice, σ1_ind_slice, 
            δk_new_slice, δ1_slice, δ2_slice, δk_slice)
    end

    return δ
end


function squarem_δ!(δ, log_share, Xβ, id_fd_alt, id_fd_ind, wt_per_mkt, NoOut, chunk_id,
     rk, vk, σk_agg, σ1_agg, σk_ind, σ1_ind, δk_new, δ1, δ2, δk)

    # Parameters for the squarem
    stepmin_α = 1.0;
    stepmax_α = 1.0;
    step_α = 4;

    # PRECOMPUTE PARAMETERS FOR SPEED
    log_market_share_NoOut = @view log_share[NoOut]
    
    # INITIALIZE VARIABLES
    type_Xβ = eltype(Xβ)
    δk .= type_Xβ.(δ);
    δ  .= δk;

    # Computing the additional views we need    
    δ_view = @view δ[NoOut];
    δ1_view = @view δ1[NoOut];    
    δk_view = @view δk[NoOut];
    δk_new_view = @view δk_new[NoOut]
    σ1_view = @view σ1_agg[NoOut];
    σk_view = @view σk_agg[NoOut];

    for i = 1:max_iter
        
        # Compute first iteration
        compute_model_σ!(σk_ind, δk, Xβ, id_fd_alt, id_fd_ind);
        wt_average!(σk_agg, σk_ind, wt_per_mkt, id_fd_alt); # need to allocate zero vector

        # View and replace elements for faster iteration
        δ1_view .= δk_view .+ log_market_share_NoOut .- log.(σk_view)
        
        # Compute second iteration
        compute_model_σ!(σ1_ind, δ1, Xβ, id_fd_alt, id_fd_ind);
        wt_average!(σ1_agg, σ1_ind, wt_per_mkt, id_fd_alt);

        # View and replace elements for faster iteration
        δ2 .= δ1_view .+ log_market_share_NoOut .- log.(σ1_view)
        
        # Compute step vectors
        rk .= δ1_view .- δk_view
        vk .= δ2 .- 2 .* δ1_view .+ δk_view
        
        # Calculate extrapolation factor
        norm_rk = norm(rk)
        norm_vk = norm(vk)
        
        if norm_vk < tol_norm || norm_rk < tol_norm
            # Markets have converged with one iteration  
            δ_view .= δ2 
            return δ
        else
            
            # Update needed
            α_candidate = norm_rk / norm_vk
            α = max(stepmin_α, min(stepmax_α, α_candidate))
            
            # Update the estimate
            δk_new_view .= δk_view .+ (2*α) .* rk .+ (α^2) .* vk

            # Compute the difference in estimates
            diff = maximum(abs.(δk_new .- δk));
            if diff < tol
                copyto!(δ, δk_new)
                return δ
            end

            if α == stepmax_α
                stepmax_α = stepmax_α * step_α
            end

        copyto!(δk, δk_new)
        fill!(σk_agg, zero(eltype(σk_agg)))
        fill!(σ1_agg, zero(eltype(σk_agg)))
        end
    end


    diff = maximum(abs.(σk_agg .- exp.(log_share)))
    @warn "Some markets did not converge in chunk $chunk_id, max diff in shares is $diff."
    copyto!(δ, δk_new)
    return δ

end

Any guidance would be greatly appreciated!

1 Like

Hey and welcome!

Just from a quick glance over the code I would guess that you have global variables messing up the timings.

The first indicator is

function run_twice(Xβ_init1, Xβ_init2)
    δ_test = copy(δ_logit) #  <- This is in the global scope
    recover_δ!(δ_test, Xβ_init1)
    ...
end

Whereas

function run_twice(Xβ_init1, Xβ_init2, δ_logit)
    δ_test = copy(δ_logit) #  <- This is now in the local scope
    recover_δ!(δ_test, Xβ_init1)
    ...
end

Should avoid performance pitfalls due to global variables.

The same goes for the recover_δ! function. Here log_share, mkt_starts_chunks, NoOut_slice, and possibly all your data - e.g.id_fd_alt - is taken from the global scope and not passed into the function as far as I can see. Right now, I can not reproduce your code without knowing these variables :person_shrugging: .

I would recommend using a custom struct or NamedTuple to store all your information and access needed fields using Parameters.jl within each function.

my_simulation_parameters = (; 
    log_share = ...,# Your variable
    ... 
    δ = δ, # The other stuff   
)

function recover_δ!(nt::NamedTuple)
    # Unpack all the stuff you need
    @unpack δ, log_share, .... = nt 
    # Do stuff 
end

This is not intended to be running code, just an example :smiley:

Also, a lot of this is documented in Performance Tips.
Next to @btime use @code_warntype to check if all your function calls are typestable.

I am sure I missed some other improvements, but this should be a starting point.

3 Likes

Just to add something about the problem with global variables. Here are four ways to sum a vector:

A = rand(1000);
const constA = A;

function sumA()
    s = 0
    for i in eachindex(A)
        s += A[i]
    end
    return s
end

function sumconstA()
    s = zero(eltype(constA))
    for i in eachindex(constA)
        @inbounds s += constA[i]
    end
    return s
end

function sumarg(v)
    s = zero(eltype(v))
    for e in v
        s += e
    end
    return s
end

function optsumarg(v)
    s = zero(eltype(v))
    @simd for e in v
        s += e
    end
    return s
end

Here are the timings:

julia> @btime sumA()
  102.561 μs (3978 allocations: 77.77 KiB)
503.9558497442067

julia> @btime sumconstA()
  697.727 ns (0 allocations: 0 bytes)
503.9558497442067

julia> @btime sumarg($A)
  696.322 ns (0 allocations: 0 bytes)
503.9558497442067

julia> @btime optsumarg($A)
  69.328 ns (0 allocations: 0 bytes)
503.9558497442068

The fastest is almost 1500 times faster.

The problem with (non-constant) global variables is that the function can not be compiled properly. When the compiler compiles the sumA function it does not know what A is. Or, it does know, but it will only compile sumA once, and it must work no matter what A is later. Since A can change, sumA will be compiled to work with any A. Which means a lot of administrative work, involving allocations.

On the other hand sumconstA, also uses a global variable, but it is constant. It can’t change, it will always be of type Vector{Float64}. So the compiler will compile the loop more efficiently.

The function sumarg takes an argument. Julia functions will be compiled once for each type of argument. So the type of v will be known at compile time. Just like sumconstA.

The function optsumarg in addition decorates the loop with @simd. It’s a message to the compiler: “this loop can be run in any order, and you can use parallel instructions”. Which it does. Increasing the speed considerably (it’s not always that much, if at all).

We can even look at the actual machine instructions with commands like julia> @code_native sumarg(A). The loop is this:

.LBB0_5:                 
        vaddsd  xmm0, xmm0, qword ptr [rcx + 8*rdx]
        inc     rdx
        cmp     rax, rdx
        jne     .LBB0_5

The loop in optsumarg uses parallel instructions which adds four elements at a time, and some cpus are able to execute more than one such instruction in each clock cycle, so the assembly looks like this:

.LBB0_6: 
        vaddpd  ymm1, ymm1, ymmword ptr [rcx + 8*rsi]
        vaddpd  ymm0, ymm0, ymmword ptr [rcx + 8*rsi + 32]
        vaddpd  ymm2, ymm2, ymmword ptr [rcx + 8*rsi + 64]
        vaddpd  ymm3, ymm3, ymmword ptr [rcx + 8*rsi + 96]
        add     rsi, 16
        cmp     rdx, rsi
        jne     .LBB0_6

I don’t have the stomach to look at the instructions generated for sumA.

5 Likes

Thank you for your guidance and the help with this. I have tried to implement the suggestions you made and now:

  • Functions within the code are type-stable (if I understand correctly).
  • All variables are either defined within the functions or passed as function arguments.
  • I also added a few type declarations for things whose type shouldn’t change—some variable types change when used in e.g., ForwardDiff.

Unfortunately, I am still having the same issue. I made up a mock example using simulated data where I made the whole code fully reproducible (attached below). I can reproduce the high-time and high-memory allocations using the simulated data, so the issue seems to be code-related, and not related to my data. If I run the code twice, in a single function, I get 5x allocations 6.3x memory estimates than if I run each code chunk separately.

Running the code separately

BenchmarkTools.Trial: 1161 samples with 1 evaluation.
 Range (min … max):  2.334 ms … 88.749 ms  ┊ GC (min … max):  0.00% … 96.73%
 Time  (median):     2.442 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   4.313 ms ± 11.421 ms  ┊ GC (mean ± σ):  41.36% ± 14.78%

  █                                                           
  █▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▅▁▅▅▆ ▇
  2.33 ms      Histogram: log(frequency) by time     77.1 ms <

 Memory estimate: 47.69 MiB, allocs estimate: 3627.


BenchmarkTools.Trial: 1123 samples with 1 evaluation.
 Range (min … max):  2.354 ms … 103.414 ms  ┊ GC (min … max):  0.00% … 97.11%
 Time  (median):     2.581 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   4.466 ms ±  11.945 ms  ┊ GC (mean ± σ):  39.25% ± 13.95%

  █▁                                                           
  ██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆▃▅▄ ▇
  2.35 ms      Histogram: log(frequency) by time      84.9 ms <

 Memory estimate: 47.69 MiB, allocs estimate: 3625. 

Running the code together

BenchmarkTools.Trial: 78 samples with 1 evaluation.
 Range (min … max):  36.885 ms … 237.903 ms  ┊ GC (min … max):  0.00% … 83.30%
 Time  (median):     39.616 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   64.304 ms ±  63.364 ms  ┊ GC (mean ± σ):  37.58% ± 27.80%

  █▂ ▁                                                          
  ████▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▄▄▄▁▄▇ ▁
  36.9 ms       Histogram: log(frequency) by time       238 ms <

 Memory estimate: 591.60 MiB, allocs estimate: 36606.

The fully reproducible code is:
problem_example.jl (2.6 KB)
functions_example.jl (9.3 KB)

I tried debugging even more, and I could not figure out why this is happening. Any guidance would be greatly appreciated.

In invert_shares_δ! in function_example.jl in the threaded loop, there are two lines which creates type instability. First in the loop there are these:

        id_alt     = @view id_fd_alt[idx]
        id_ind     = @view id_fd_ind[idx];

These makes id_alt and id_ind views, i.e. of type SubArray.

At the end of the loop there are

        id_alt     = id_alt .- id_alt[1] .+ 1;
        id_ind     = id_ind .- id_ind[1] .+ 1;

These are reassignments (and allocations), making id_alt and id_ind Vector{Int}. Normally it’s a bad thing to let variables switch between two types. I tried to give the normalized indices new names _id_alt and _id_ind to feed into squarem_δ!, but there is still the weird timing of run_twice. Also when running single threaded. I have even tried to disable the GC when benchmarking run_twice, to no avail (I have enough memory). It’s weird. I’ll get back to this, have some business to attend to.

Hmm, I think the problem is this. invert_shares_δ! modifies the δ_test. When you only initialize it once before the benchmark, only the first run of invert_shares_δ! in the benchmark will have the right δ_test. But @benchmark runs it a thousand times, all except the first will have a modified δ_test. In the run_twice function, it is initialized every time.

3 Likes

Yes, this seems to be it. So @benchmark does not seem to be a good way to evaluate the performance of a fixed point algorithm like mine. Writing a wrapper function of the following form (passing the other parameters):

function evaluate(δ_in, Xβ, params)
    δ_out = copy(δ_in)
    invert_shares_δ!(δ_out, Xβ, params)
    return δ_out
end

Gives the “correct” benchmark performance:

BenchmarkTools.Trial: 151 samples with 1 evaluation.
 Range (min … max):  18.512 ms … 378.933 ms  ┊ GC (min … max):  0.00% … 90.32%
 Time  (median):     20.174 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   34.223 ms ±  68.247 ms  ┊ GC (mean ± σ):  39.90% ± 18.29%

  █▁                                                            
  ██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄▅ ▄
  18.5 ms       Histogram: log(frequency) by time       372 ms <

 Memory estimate: 307.06 MiB, allocs estimate: 18968.

This is roughly half of the time than running the code twice together (see my benchmark estimates above). So it makes sense.

Sad news—my code is slower than I thought! I guess I will have to make it faster. :sweat_smile:

1 Like

There are a few things you could try for your code that might help. It looks like your allocations are really pretty high, so that’s likely why the code is slower than ideal. One thing to try would be if you know that you always want views instead of copies, then instead of doing a dozen @view macros just use

@views function my_func(x)
   ...
   ...
end

Putting @views in front of the whole function will tell the compiler to only use views, 1) so you don’t have to type @view a ton and b), might catch copies that you miss.

Also, fusing dots can sometimes lead to better performance, so try

@. id_alt +=  id_alt[1] + 1  #<-- this
id_alt = id_alt .- id_alt[1] .+ 1  # <-- instead of this

Dot fusion though will be a minor thing compared to getting rid of allocations. (edit: just be careful when using @. because everything after it will become element-wise, so if you need actual matrix math then you can mess up your results).

Definitely run your compute_model_σ! and wt_average! functions by themselves and try to get them to 0 allocations since they are run repeatedly.

Expanding on this a little. squarem_δ! looks like it gets run mkt_starts_chunks times in the threaded loop within invert_shares_δ!. Then compute_model_σ! and wt_average! both get run twice per loop in squarem_δ!

So any allocations by compute_model_σ! and wt_average!happen 2e5*mkt_start_chunks times since max_iter = 1e5 in the squarem_δ! loop.

Considering wt_average!:

# This function function allocates because of the var_sum .* wt)
wt_average!(out, var_sum, wt, id) = accumarray!(out, id, var_sum .* wt)

# This version is allocation free and runs ~15% faster on a 10-element array
function wt_average!(out, var_sum, wt, id)
    var_sum .*= wt
    accumarray!(out, id, var_sum)
end

Making the above change helped but didn’t change much, the bigger issue is that compute_model_δ! uses the allocating accumarray instead of the in-place accumarray! and a new array variable δ_μ is created, so that’s hundreds of thousands of array creations that are not necessary. Finding a way to move those out of the inner loops should help a lot.

It’s awkward but the setup and teardown phases can handle mutating methods. That section doesn’t mention this directly but also note that:

Defaults to BenchmarkTools.DEFAULT_PARAMETERS.evals = 1 . If the function you study mutates its input, it is probably a good idea to set evals=1 manually.

1 Like

As remarked above, the first thing to eliminate is allocations. Allocations are particularly bad in threaded code. Sometimes allocations can be removed by using broadcasted assignments, e.g. with @. as suggested above. Sometimes one must write loops instead of “shorter” vector operations.

Sometimes one needs temporary arrays. I’m sure there are some packages for that around, I’ve just used a simple minded method of storing vectors in task local storage, and retrieve and resize! them to the needed size, and reshape it if needed at higher dimensions. For this purpose I just use a macro:

"""                                                                               
    @tlscache type [constructor]                                                  
                                                                                  
Allocates or fetches task local memory.                                           
                                                                                  
Loops in parallel tasks requiring temporary                                       
memory can stress the garbage collector. If it's troublesome to pre-allocate, one 
can use a task local vector which is reused and resized.                          
This only allocates when the size is increased beyond earlier sizes.              
An alternative is to use [`Base.ScopedValues`](@ref). (v1.11 or later)            
                                                                                  
# Examples                                                                        
```julia                                                                          
julia> v = @tlscache Vector{Float64}                                              
julia> resize!(v, 16)                                                             
julia> b = @tlscache BitVector falses(12)                                         
```                                                                               
"""                                                                               
macro tlscache(type, makeit)                                                      
    sym = Expr(:quote, gensym("tls"))                                             
    quote                                                                         
        # type annotate return value, otherwise we get Any, and type instability. 
        get!(() -> $(esc(makeit)), task_local_storage(), ($sym,$(esc(type))))::$(esc(type))                                                                        
    end                                                                           
end                                                                               
                                                                                  
                                                                                  
macro tlscache(type)                                                              
    sym = Expr(:quote, gensym("tls"))                                             
    quote                                                                         
        get!(() -> $(esc(type))(), task_local_storage(), ($sym,$(esc(type))))::$(esc(type))                                                                        
    end                                                                           
end                                                                                    
1 Like

Just providing a quick update to close this. It does seem like using @benchmark to evaluate an in-place fixed point iteration function without restarting the initial values, is what generated the discrepancy in the benchmark results.

In summary, the benchmark code in the first run takes as an initial value δ_test = copy(δ_logit) and returns the fixed point in the same array. In subsequent samples of the benchmark code, it takes the final iteration and, since that’s the fixed point, returns itself (after verifying it is indeed a fixed point). If you think really hard about it, this is expected behavior.

The only way to avoid this is to explicitly restart the function each time, as my wrapper function evaluate in this comment. I have modified my code substantially, to avoid all the cumbersome views, and can confirm this is the case. I managed to decrease allocations by pre-allocating temporary arrays, making my code type stable, and passing all elements into the function.

Thank you all for all your help! Your suggestions improved the code speed by 10x (ish!).