Reducing allocations to optimize (to match Java speed)

I designed and implemented a method where the core task is to evaluate some 2D integrals using Gauss-Legendre methods in Julia. For practical reasons, the final code has to be in Java; the translation into Java is about 2.5-3 times faster without any effort at optimization.

I spent some time examining the Julia implementation to find bottlenecks, and the most obvious sign of poor design/performance is that benchmarking shows large number of allocations and a corresponding large memory usage (about 1GB compared to a data array that is only 61 by 61, with Float64 elements).

I’ve isolated a couple of the core functions to create an MWE; the times and memory usage are very small, but the functions are called many times in the real application and I think what is happening is that every time I call these functions there are repeated allocations.

For this reason, I’m looking for advice on how to address this. Any tips would be greatly appreciated. The key point is the that integrand function only has 6 allocations, but the integrator incurs 40 - I’d like to reduce that number.

In the code examples below, I have a function integrator that calls a function f```` that evaluates the integrand. To simplify setting variables, I also have a file setVals.jl``` that allows setting the parameters to benchmark the functions. So, my steps are:

include("setVals.jl") to set variables to benchmark functions. Then, I

@benchmark  f($Mmatrix,$x,$yVec,$c,$d,$wInt)
BenchmarkTools.Trial: 
  memory estimate:  832 bytes
  allocs estimate:  40
  --------------
  minimum time:     765.275 ns (0.00% GC)
  median time:      802.927 ns (0.00% GC)
  mean time:        942.206 ns (10.27% GC)
  maximum time:     435.468 ÎĽs (99.77% GC)
  --------------
  samples:          10000
  evals/sample:     109
@benchmark integrator($a,$b,$y1,$y2,$x,$Mmatrix,$c,$d)
BenchmarkTools.Trial: 
  memory estimate:  5.77 KiB
  allocs estimate:  243
  --------------
  minimum time:     5.366 ÎĽs (0.00% GC)
  median time:      5.651 ÎĽs (0.00% GC)
  mean time:        7.423 ÎĽs (19.29% GC)
  maximum time:     8.354 ms (99.86% GC)
  --------------
  samples:          10000
  evals/sample:     6

Here are the contents of the file setting the two functions:

 StaticArrays


function integrator(a::Float64,b::Float64,y1::Float64,y2::Float64,x::Array{Float64,1},M::SMatrix{3,3,Float64},
                    c::Float64,d::Float64)
    limitDiff = (b-a)/2.0
    limitAvg = (b+a)/2.0
    
    integrand = map(theZ->f(M,x,[y1,y2,limitDiff*theZ+limitAvg],c,d,wInt), zInt)

    limitDiff*(wInt'*integrand)
end	 

function f(M::SMatrix{3,3,Float64},x::Array{Float64,1},y::Array{Float64,1},c::Float64,d::Float64,wInt::SVector{5,Float64})
    if ((x[1]-y[1])^2 + (x[2]-y[2])^2)<Base.eps(x[1])
        θ=0.0
    else
        θ=atan(((x[1]-y[1])^2 + (x[2]-y[2])^2)^0.5,  (x[3]-y[3]))
    end
    if abs( x[2]-y[2] )<Base.eps(x[1])
        Ď•=0.0
    else
        Ď• = atan(x[2]-y[2], x[1]-y[1])
    end

    γ = @SVector [sin(θ)cos(ϕ), sin(θ)sin(ϕ) , cos(θ)]

    theVal = (  (-2 + 15Îł[1]^2 - 15Îł[1]^4 + (3-4d)*(1-3Îł[1]^2))*M[1,1]
                + (6Îł[1]Îł[2] - 3(3-4b)Îł[1]Îł[2] - 15Îł[1]^3*Îł[2])*M[1,2]
                + (6Îł[1]Îł[3] - 3(3-4b)Îł[1]Îł[3] - 15Îł[1]^3*Îł[3])*M[1,3]
                + (9Îł[1]Îł[2] - 15Îł[1]^3*Îł[2])*M[2,1]
                + (-1 + 3Îł[1]^2 + 3Îł[2]^2 - 15Îł[1]^2*Îł[2]^2)*M[2,2]
                + (3Îł[2]Îł[3] - 15Îł[1]^2*Îł[2]*Îł[3])*M[2,3]
                + (9Îł[1]Îł[3] - 15Îł[1]^3*Îł[3])*M[3,1]
    + (3Îł[2]Îł[3] - 15Îł[1]^2*Îł[2]Îł[3])*M[3,2]
    + (-1 + 3Îł[1]^2 + 3Îł[3]^2 - 15Îł[1]^2*Îł[3]^2)*M[3,3] )

    # avoid singularity where output point is on fracture (or too close).
    rCubed = r(x,y)^3
    if( rCubed < 10^-5 )
        rCubed=10^-5
    end

    return(theVal/(16*Ď€*c*(1-d)*rCubed))
end

function r(x::Array{Float64,1}, Îľ::Array{Float64,1})
    ((x-Îľ)'*(x-Îľ))^0.5
end

Here are the contents of the setVals.jl file:

using StaticArrays
using FastGaussQuadrature

a = -15.0
b = -14.0

x=[1.,1.,0.]

Mmatrix = @SMatrix [10.0e9 0 0; 0 0 0; 0 0 0]

zInt0,wInt0=gausslegendre(5)
zInt=@SVector [zInt0[i] for i in 1:5]
wInt=@SVector [wInt0[i] for i in 1:5]


c=1.3e9
d=0.25

y1=-15.0
y2=-2.

yVec=[y1; y2; -15.0]

The problem with function f is two-fold:

First, it is referencing the global variable b, which causes allocations. Read here for more details.

Second, it calls the function r, which allocates temporary arrays during its computation. You could just make it a for loop to avoid the allocations, and you can tack on @simd to greatly speed it up as well (makes it over 10x faster on my machine):

function r(x::Array{Float64,1}, Îľ::Array{Float64,1})
   s = 0.0
   @inbounds @simd for i = 1:length(x)
      s += (x[i] - Îľ[i])^2
   end
   sqrt(s)
end

I’ll leave integrator as an exercise for now :slight_smile:

1 Like

I would suggest reading through Performance Tips · The Julia Language.

1 Like

Oops - that was a typo - I simplified the original code and variable names to make it an M form of WE, and mistyped that. Replacing it with d still lead to an overhead when called in integrator.

I have read the section you link to, but still cannot figure out why.

Fair enough, though unrolling into loops sacrifices the readability of the vector products. Again, it’s not obvious to me why the compiler cannot achieve the same think (except possibly that writing loops would fix the number of terms). This is what I’m trying to learn.

There are lots of opportunities for SIMD too, but since the Java code (2-3 times faster) doesn’t apply parallelization, that wouldn’t be a good comparison.

I have - as noted in response to the previous commenter, I still fail to see what might be causing issues (noting the typo that left an accidental global in f()).

You define zint and wint in the global scope, then access them in local scope. That’s a recipe for type instability.

Here’s my take on it (note: not tested for correctness). integrator runs in ~500 ns with no allocations.

using StaticArrays, Parameters, FastGaussQuadrature, LinearAlgebra

function integrator(params, dat, gl)
    integrand = map(z -> f(z, params, dat), gl.z)
    return params.limitDiff*(gl.wâ‹…integrand)
end

function f(z, params, dat)
    @unpack limitDiff, limitAvg, a, b, c, d = params
    @unpack x, y, M = dat

    y[3] = limitDiff*z + limitAvg
    dxy = x - y

    θ = (dxy[1]^2 + dxy[2]^2) < eps() ? 0.0 : atan(hypot(dxy[1], dxy[2]), dxy[3])
    Ď• = abs(dxy[2]) < eps() ? 0.0 : atan(dxy[2], dxy[1])
    γ = @SVector [sin(θ)cos(ϕ), sin(θ)sin(ϕ) , cos(θ)]

    theVal = (  (-2 + 15Îł[1]^2 - 15Îł[1]^4 + (3-4d)*(1-3Îł[1]^2))*M[1,1]
                + (6Îł[1]Îł[2] - 3(3-4b)Îł[1]Îł[2] - 15Îł[1]^3*Îł[2])*M[1,2]
                + (6Îł[1]Îł[3] - 3(3-4b)Îł[1]Îł[3] - 15Îł[1]^3*Îł[3])*M[1,3]
                + (9Îł[1]Îł[2] - 15Îł[1]^3*Îł[2])*M[2,1]
                + (-1 + 3Îł[1]^2 + 3Îł[2]^2 - 15Îł[1]^2*Îł[2]^2)*M[2,2]
                + (3Îł[2]Îł[3] - 15Îł[1]^2*Îł[2]*Îł[3])*M[2,3]
                + (9Îł[1]Îł[3] - 15Îł[1]^3*Îł[3])*M[3,1]
                + (3Îł[2]Îł[3] - 15Îł[1]^2*Îł[2]Îł[3])*M[3,2]
                + (-1 + 3Îł[1]^2 + 3Îł[3]^2 - 15Îł[1]^2*Îł[3]^2)*M[3,3] )

    # avoid singularity where output point is on fracture (or too close).
    rCubed = clamp(norm(dxy)^3, 1e-5, Inf)

    return theVal/(16*Ď€*c*(1-d)*rCubed)
end

@with_kw struct QuadratureParams{T}
    a::T
    b::T
    c::T
    d::T
    limitDiff::T = (b - a)/2.0
    limitAvg::T  = (b + a)/2.0
end

@with_kw struct QuadratureData{T}
    x::SArray{Tuple{3},T,1,3}
    y::MArray{Tuple{3},T,1,3}
    M::SArray{Tuple{3,3},T,2,9}
end
QuadratureData(x,y,M) =  QuadratureData(SVector{3}(x),
                                        MVector{3}(y),
                                        SMatrix{3,3}(M))

@with_kw struct GaussLegendre{N}
    z::SArray{Tuple{N},Float64,1,N}
    w::SArray{Tuple{N},Float64,1,N}
end
function GaussLegendre(N)
    z0, w0 = gausslegendre(N)
    return GaussLegendre(SVector{N}(z0), SVector{N}(w0))
end

x = [1., 1., 0.]
y = [-15.0, -2.0, -15.0]
M = [10.0e9 0 0; 0 0 0; 0 0 0]

p = QuadratureParams(a=-15.0, b=-14.0, c=1.3e9, d=0.25)
dat = QuadratureData(x, y, M)
gl = GaussLegendre(5)

using BenchmarkTools
@benchmark integrator($p, $dat, $gl)

Because your temporary arrays are of fixed size, you’re not writing to them, and you’re already using StaticArrays, make them SVectors instead. Eg:

function integrator(a::Float64,b::Float64,y1::Float64,y2::Float64,x::Array{Float64,1},M::SMatrix{3,3,Float64},
                    c::Float64,d::Float64)
    limitDiff = (b-a)/2.0
    limitAvg = (b+a)/2.0
    
    integrand = map(theZ->f(M,x,SVector(y1,y2,limitDiff*theZ+limitAvg),c,d,wInt), zInt)

    limitDiff*(wInt'*integrand)
end	 

That should also allow Julia to elide bounds checks, because size is known at compile time.

Making changes like that would be easier if you removed some of those type annotations. Note that they do not increase performance (but they may still help you catch bugs).

There are lots of opportunities for SIMD too, but since the Java code (2-3 times faster) doesn’t apply parallelization, that wouldn’t be a good comparison.

SIMD is within-core parallelization. You aren’t taking any resources away from other programs by not using it.
However, given that your x and y are each of length 3, @simd will actually make your function r run slower. It is too short to be vectorized.

EDIT:
If you do want to edit y, like in stillyslalom’s code above, I’d recommend using MVectors, because the length is static.

EDIT:
Fix all the "Any"s:

julia> @code_warntype integrator(a,b,y1,y2,x,Mmatrix,c,d)
Body::Any
3 1 ─ %1 = (Base.sub_float)(b, a)::Float64                                                                       │╻ -
  │   %2 = (Base.div_float)(%1, 2.0)::Float64                                                                    │╻ /
4 │   %3 = (Base.add_float)(b, a)::Float64                                                                       │╻ +
  │   %4 = (Base.div_float)(%3, 2.0)::Float64                                                                    │╻ /
6 │   %5 = %new(getfield(Main, Symbol("##3#4")){Float64,Float64,Array{Float64,1},SArray{Tuple{3,3},Float64,2,9},Float64,Float64,Float64,Float64}, y1, y2, x, M, c, d, %2, %4)::getfield(Main, Symbol("##3#4")){Float64,Float64,Array{Float64,1},SArray{Tuple{3,3},Float64,2,9},Float64,Float64,Float64,Float64}
  │   %6 = (Main.map)(%5, Main.zInt)::Any                                                                        │ 
8 │   %7 = (Base.adjoint)(Main.wInt)::Any                                                                        │ 
  │   %8 = (%7 * %6)::Any                                                                                        │ 
  │   %9 = (%2 * %8)::Any                                                                                        │ 
  └──      return %9        

Like stillyslalom said, they’er caused by the zInt and wInt. Two solutions:

  1. Make zInt and wInt constant. Do that by defining them as const zInt = ....
  2. Pass them in as arguments. This is the recommended approach.

More importantly:

@code_warntype f(Mmatrix,x,yVec,c,d,wInt)

You’re notice a whole lot of problems in there. Fix them. It seems to start by referencing Main.b, ie, another variable defined in global scope. Simply adding b as an argument to f goes a long way:

julia> @benchmark  f($Mmatrix,$x,$yVec,$c,$d,$wInt)
BenchmarkTools.Trial: 
  memory estimate:  832 bytes
  allocs estimate:  40
  --------------
  minimum time:     1.279 ÎĽs (0.00% GC)
  median time:      1.407 ÎĽs (0.00% GC)
  mean time:        2.438 ÎĽs (37.02% GC)
  maximum time:     8.723 ms (99.92% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark  f($Mmatrix,$x,$yVec,$c,$d,$wInt,$b)
BenchmarkTools.Trial: 
  memory estimate:  224 bytes
  allocs estimate:  2
  --------------
  minimum time:     486.533 ns (0.00% GC)
  median time:      493.477 ns (0.00% GC)
  mean time:        596.319 ns (13.69% GC)
  maximum time:     443.816 ÎĽs (99.79% GC)
  --------------
  samples:          10000
  evals/sample:     195

Hello,

also: @code_warntype flags some Int*Float64 multiplications in theVal. The return type of f() is Any because of that apparently. This might be harmless in this case, though. Can someone with more knowledge comment ?

Best Regards

Edit: Ah, OK, I see

so probably the Any’s are purely because of that, not the Int*Float64.

Yes, it was all because of Main.b.

Adding b to the arguments of f, wInt and zInt to the arguments of integrator, and making x and y SVectors, I now have:

julia> @benchmark integrator($a,$b,$y1,$y2,$svx,$Mmatrix,$c,$d,$wInt,$zInt)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.522 ÎĽs (0.00% GC)
  median time:      1.531 ÎĽs (0.00% GC)
  mean time:        1.541 ÎĽs (0.00% GC)
  maximum time:     4.437 ÎĽs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

This was the original version:

julia> @benchmark integrator($a,$b,$y1,$y2,$x,$Mmatrix,$c,$d)
BenchmarkTools.Trial: 
  memory estimate:  5.22 KiB
  allocs estimate:  228
  --------------
  minimum time:     8.720 ÎĽs (0.00% GC)
  median time:      9.271 ÎĽs (0.00% GC)
  mean time:        17.103 ÎĽs (32.98% GC)
  maximum time:     39.408 ms (99.89% GC)
  --------------
  samples:          10000
  evals/sample:     3

About 6x faster. Of course, this is only a small part of the entire program, so it doesn’t mean this code will be 2x faster than your Java version.
But I would go through all your code with @code_warntype and look for any type instabilities. Look out especially for inference failures that involved Main.VARIABLE_NAME. That inference failure will be because Main was a global variable of unknown type. Making VARIABLE_NAME an argument to the function is the easiest fix.
Inference failures propagate. If C = foo(A,B) was not inferred, then anything depending on C, or one of C’s descendants, will also not be inferred. Therefore, fix the problem at the root.

You may also want to give Traceur.jl a try.

EDIT, for reference, the changes compared to the original code are minimal:

using StaticArrays
function integrator(a::Float64,b::Float64,y1::Float64,y2::Float64,x,M::SMatrix{3,3,Float64},
                    c::Float64,d::Float64,wInt,zInt)
    limitDiff = (b-a)/2.0
    limitAvg = (b+a)/2.0
           
    integrand = map(theZ->f(M,x,SVector(y1,y2,limitDiff*theZ+limitAvg),c,d,wInt,b), zInt)
   
    limitDiff*(wInt'*integrand)
end
function f(M::SMatrix{3,3,Float64},x,y,c::Float64,d::Float64,wInt::SVector{5,Float64}, b)
    if ((x[1]-y[1])^2 + (x[2]-y[2])^2)<Base.eps(x[1])
        θ=0.0
    else
        θ=atan(((x[1]-y[1])^2 + (x[2]-y[2])^2)^0.5,  (x[3]-y[3]))
    end
    if abs( x[2]-y[2] )<Base.eps(x[1])
        Ď•=0.0
    else
        Ď• = atan(x[2]-y[2], x[1]-y[1])
    end

    γ = @SVector [sin(θ)cos(ϕ), sin(θ)sin(ϕ) , cos(θ)]

    theVal = (  (-2 + 15Îł[1]^2 - 15Îł[1]^4 + (3-4d)*(1-3Îł[1]^2))*M[1,1]
                + (6Îł[1]Îł[2] - 3(3-4b)Îł[1]Îł[2] - 15Îł[1]^3*Îł[2])*M[1,2]
                + (6Îł[1]Îł[3] - 3(3-4b)Îł[1]Îł[3] - 15Îł[1]^3*Îł[3])*M[1,3]
                + (9Îł[1]Îł[2] - 15Îł[1]^3*Îł[2])*M[2,1]
                + (-1 + 3Îł[1]^2 + 3Îł[2]^2 - 15Îł[1]^2*Îł[2]^2)*M[2,2]
                + (3Îł[2]Îł[3] - 15Îł[1]^2*Îł[2]*Îł[3])*M[2,3]
                + (9Îł[1]Îł[3] - 15Îł[1]^3*Îł[3])*M[3,1]
    + (3Îł[2]Îł[3] - 15Îł[1]^2*Îł[2]Îł[3])*M[3,2]
    + (-1 + 3Îł[1]^2 + 3Îł[3]^2 - 15Îł[1]^2*Îł[3]^2)*M[3,3] )

    # avoid singularity where output point is on fracture (or too close).
    rCubed = r(x,y)^3
    if( rCubed < 10^-5 )
        rCubed=10^-5
    end

    return(theVal/(16*Ď€*c*(1-d)*rCubed))
end

function r(x, Îľ)
    ((x-Îľ)'*(x-Îľ))^0.5
end

Wow what great feedback you got above! Making wInt and zInt non-global was what I referred to as the “exercise” btw :slight_smile:

IMO, the best way to increase readability is to give your function r a proper explanatory name. Personally, I wouldn’t need to look at how it’s implemented then.

Good point! I had assumed that in the real code they’d be larger than 3, but that might not be the case. SVector is a better choice then!

Sigh - I knew I shouldn’t have tried to create the MWE at the end of a long day:tired_face: - really, my purpose was to make sure they were not globals but I just missed it when extracting functions out of the “real” code… really… it’s true…:grimacing:

Thanks very very much :smiley:- this is exactly what I was trying to figure out, as I expected that the array in the map() function was creating issues, but I couldn’t figure out how to achieve what you demonstrated.

As noted in other responses, the globals were errors in extracting the MWE from other code, but this is a big help.

Thanks again.

Cool - thanks (trying, perhaps without success :smiley:, to again attribute the globals to late night absentmindedness as I definitely know that problem).

I am really curious now though - your version gets 0 allocations, but Elrod shows an improved version of integrator() that still has 228. I’ll compare them.

I do note that you added @unpack to several lines - I will check the documentation to see what that does. Checking through your code too, I see you’ve done some work to make changes that I’m assuming are more idiomatic, better performing Julia. This is exactly where I’m trying to ramp up, so having the various implementations to test against here is great.

The point regarding function name out of context is good (in context with documentation you’d know it was distance in 3D Cartesian coordinates :grin:). I’d also never name a function just f either for the same reason, in “real code”.

My improved integrator had 0 allocations.
It was the old version, which I benchmarks for comparison, that had 228.

The initial version of f I showed also allocated, because I wasn’t using SVectors yet, and the function r creates a new vector.

Creating SVectors does not allocate, so long as the length is known at compile time.

Edit:
@unpack comes from the Parameters.jl library.
It “unpacks” a struct, pulling out the names fields in one line that is cleaner than foo.A, foo.B, foo.C.

1 Like

Thanks for the clarification - I apologize for the misreading.

And thanks again for the clear and detail help.

Small aside: I am (as you can tell) pretty new to Julia, and I really like the language. A challenge in gaining broader acceptance is illustrated by the comparison to the Java code - the latter “just works” at a higher speed, but I’m having to learn & adopt some code styles to get the same performance with Julia. Just my opinion, but the documentation is really good for v1.0.2 language and helpful, but it will really help to get more and more tutorials (case histories) so that newer folks like me can learn to write the Julia code in a style with good performance more directly.

Note: a point I’m ignoring in that previous paragraph is that I do think it was much faster and easier for me to develop the algorithm in Julia than it would have been in Java, which is also an important issue…

I’m a big fan of Parameters.jl as a prophylactic against errant global variables. It’s much easier to pass a struct as a function argument and unpack its members as needed than it is to do the mental bookkeeping required by an alphabet soup-style call signature. I should’ve added comments explaining the rationale behind the changes I made–apologies, it was late.

1 Like

Definitely no need to apologize regarding comments - the sample was great because it showed me a much simpler way to accomplish my task. From that brief example, hour point regarding the prophylactic is quite clear and I see the point immediately! In my own late night code translation I obviously made mistakes because of my alphabet soup. Parameters.jl will be my friend from now on.

Also, this is what I got from @stillyslalom’s version:

julia> @benchmark integrator($p, $dat, $gl)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.552 ÎĽs (0.00% GC)
  median time:      1.564 ÎĽs (0.00% GC)
  mean time:        1.571 ÎĽs (0.00% GC)
  maximum time:     3.060 ÎĽs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

which is the same as your version, once the SVector change has been made and the accidental-globals cleared up:

julia> @benchmark integrator($a,$b,$y1,$y2,$svx,$Mmatrix,$c,$d,$wInt,$zInt)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.522 ÎĽs (0.00% GC)
  median time:      1.531 ÎĽs (0.00% GC)
  mean time:        1.541 ÎĽs (0.00% GC)
  maximum time:     4.437 ÎĽs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

The version I posted’s only change vs yours is adding the globals as arguments, deleting a few of the type annotations, and then adding SVector around the creation of y, and also making x an SVector (that I named svx).
I think it’s useful to look at to realize that your own code was already basically there, you just needed to make the tiniest of changes to get a big performance boost.

stillyslalom’s version is nice to look at because it is more organized, and produces a nicer API. Who wants to call a function with this mountain of arguments: (a,b,y1,$2,svx,Mmatrix,c,d,wInt,zInt), if you could instead use (p, dat, gl)?

Others can correct me if I’m wrong, but as I understand it Julia’s philosophy is about getting the right answer first, while allowing the option to optimize it and reach C/C++ speeds.

This is sort of the opposite of Rust’s approach, where the compiler will do it’s best to help you to write good code.
I as someone who likes writing fast code would have an easier time with the latter. But I see that for a language aiming at R and Python, the former is necessary.
Many of my colleagues do not care about performance, but would throw a fit and switch to R as soon as Julia complained about type instabilities, or anything of that sort. They want to spend the least amount of time thinking about or writing code possible. Julia is well positioned for an audience like that on the future.

Julia does have “opt in” tools to make it easier to diagnose performance problems, specifically things like Traceur, Profile view, and Base’s @code_warntype. I wouldn’t be surprised if we also get better tools for teaching where allocations come from some day.

But these are opt-in for the sake of the scientists who want to run an analysis and get back to the science ASAP. Opt-in means a slightly higher learning curve to familiarize yourself with these libraries. So perhaps there is where tutorials can come in.
How much comp-sci background could those assume?
I haven’t met many people who know what “SIMD” means.

Thanks for this useful summary. Leaving aside the accidental globals, I had worked to get the code implemented correctly, but the last details you’ve mentioned are what I did not really understand yet. It does point toward one of the challenges I have in learning from reading documentation - in my overview of the StaticArrays, I missed the MVector version, which in the full code will be required for the variable x. Yes, I should have read more closely but since what I learned in my first pass was SVector I thought I could only use it for vectors with static contents as well as dimensions! Oops. :disappointed_relieved: