Usage of arrays of static arrays

I was tempted to use arrays of static 3-element vectors for x,y,z coordinates that will be used for many geometrical calculations. To test performance, I wrote some codes that basically calculates unit vectors along a curve and between each points on the curve.

using StaticArrays, LinearAlgebra, BenchmarkTools

function unitvec!(uv::AbstractArray, v::AbstractArray)
    uv .= v ./ norm(v)

    return uv
function unitvec(v::AbstractVector)
    uv = v ./ norm(v)

    return uv

function genCur(n)
    r = rand(n, 3)
    r[1, :] .= 0
    cumsum!(r, r, dims=1)

    rrow = copy(r)
    rcol = copy(transpose(r))
    rsv = copy(SVector{3}.(eachrow(r)))
    rmv = copy(MVector{3}.(eachrow(r)))
    return rrow, rcol, rsv, rmv

function nextuv_row(rrow)
    uvrow = diff(rrow, dims=1)
    #unitvec!.(eachrow(uvrow), eachrow(uvrow))
    @inbounds @simd for ii in axes(uvrow, 1)
        @views unitvec!(uvrow[ii, :], uvrow[ii, :])

    return uvrow
function nextuv_col(rcol)
    uvcol = diff(rcol, dims=2)
    #unitvec!.(eachcol(uvcol), eachcol(uvcol))
    #unitvec!.((view(uvcol,:,jj) for jj=axes(uvcol,2)),(view(uvcol,:,jj) for jj=axes(uvcol,2)))
    @inbounds @simd for jj in axes(uvcol, 2)
        @views unitvec!(uvcol[:, jj], uvcol[:, jj])

    return uvcol
function nextuv_sv(rsv)
    drsv = diff(rsv, dims=1)
    uvsv = map(unitvec,drsv)

    return uvsv
function nextuv_mv(rmv)
    uvmv = diff(rmv, dims=1)
    #@views unitvec!.(uvmv, uvmv)

    return uvmv

function pairwiseuv_row(rrow)
    uvprow = rrow .- rrow'[[CartesianIndex()], :, :]
    @inbounds for kk in axes(uvprow, 3), ii in axes(uvprow, 1)
        @views unitvec!(uvprow[ii, :, kk], uvprow[ii, :, kk])

    return uvprow
function pairwiseuv_col(rcol)
    uvpcol = rcol .- rcol[:, [CartesianIndex()], :]
    @inbounds for kk in axes(uvpcol, 3), jj in axes(uvpcol, 2)
        @views unitvec!(uvpcol[:, jj, kk], uvpcol[:, jj, kk])

    return uvpcol
function pairwiseuv_sv(rsv)
    drsv = rsv .- permutedims(rsv)
    #uvpsv = unitvec.(drsv)
    uvpsv = map(unitvec,drsv)

    return uvpsv
function pairwiseuv_mv(rmv)
    uvpmv = rmv .- permutedims(rmv)
    #@views unitvec!.(uvpmv, uvpmv)
    #@inbounds for ii in eachindex(uvpmv)
    #    unitvec!(view(uvpmv, ii), view(uvpmv, ii))

    return uvpmv

rrow, rcol, rsv, rmv = genCur(1000);

uvrow = nextuv_row(rrow);
uvcol = nextuv_col(rcol);
uvsv = nextuv_sv(rsv);
uvmv = nextuv_mv(rmv);

uvprow = pairwiseuv_row(rrow);
uvpcol = pairwiseuv_col(rcol);
uvpsv = pairwiseuv_sv(rsv);
uvpmv = pairwiseuv_mv(rmv);

The result is kind of ,ehh, underwhelming?

julia> @benchmark $uvrow = nextuv_row($rrow)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  22.200 ΞΌs …  5.561 ms  β”Š GC (min … max): 0.00% … 99.37%
 Time  (median):     30.300 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   30.967 ΞΌs Β± 92.985 ΞΌs  β”Š GC (mean Β± Οƒ):  5.19% Β±  1.72%

  β–†β–‡β–„β–ƒ β–ƒβ–‚   β–β–β–β–…β–…β–†β–ˆβ–†β–…β–ƒβ–„β–„β–ƒβ–‚β–β–‚β–β– ▁▁                             β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–ƒβ–…β–„β–ƒβ–ƒβ–β–„β–„β–…β–…β–…β–„β–„β–…β–„β–…β–…β–„β–ƒβ–β–„β–†β–‡β–† β–ˆ
  22.2 ΞΌs      Histogram: log(frequency) by time      51.4 ΞΌs <

 Memory estimate: 23.52 KiB, allocs estimate: 2.

julia> @benchmark $uvcol = nextuv_col($rcol)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.200 ΞΌs …  6.258 ms  β”Š GC (min … max): 0.00% … 99.46%
 Time  (median):     31.300 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   31.198 ΞΌs Β± 88.420 ΞΌs  β”Š GC (mean Β± Οƒ):  4.80% Β±  1.72%

  ▂▆▇▄▁▂▂   β–‚β–‚            ▁▄▃ β–†β–ˆβ–‡β–„β–„β–„β–ƒβ–‚β–β–„β–ƒβ–             ▁ ▁    β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–†β–‚β–ƒβ–‡β–†β–…β–†β–ƒβ–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–†β–‡β–†β–†β–†β–ˆβ–ˆβ–‡β–ˆβ–‡β–†β–† β–ˆ
  23.2 ΞΌs      Histogram: log(frequency) by time      39.6 ΞΌs <

 Memory estimate: 23.52 KiB, allocs estimate: 2.

julia> @benchmark $uvsv = nextuv_sv($rsv)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):   5.660 ΞΌs …  1.598 ms  β”Š GC (min … max):  0.00% … 98.71%
 Time  (median):     21.520 ΞΌs              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   19.479 ΞΌs Β± 49.566 ΞΌs  β”Š GC (mean Β± Οƒ):  12.76% Β±  5.09%

  β–ˆβ–‡β–ƒ                      ▂▁▁▂▆▆▆▅▇▇▄▃▂▁▁                  ▁ β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–‡β–‡β–‡β–„β–„β–„β–ƒβ–†β–…β–„β–ƒβ–„β–…β–ƒβ–…β–„β–β–β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–†β–‡β–†β–‡β–‡β–…β–†β–†β–†β–†β–†β–†β–ˆβ–ˆ β–ˆ
  5.66 ΞΌs      Histogram: log(frequency) by time      36.4 ΞΌs <

 Memory estimate: 47.03 KiB, allocs estimate: 4.

julia> @benchmark $uvmv = nextuv_mv($rmv)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.700 ΞΌs …  2.052 ms  β”Š GC (min … max): 0.00% … 97.22%
 Time  (median):     12.400 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   16.273 ΞΌs Β± 44.583 ΞΌs  β”Š GC (mean Β± Οƒ):  6.03% Β±  2.20%

  β–„β–ˆβ–…β–‚β–‚β– β–ƒβ–‚ ▂▁▂▁                β–‚ ▃▅▂▁  ▁▂                    β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–†β–†β–„β–„β–…β–β–β–β–†β–ˆβ–…β–„β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–†β–…β–…β–…β–…β–…β–…β–…β–…β–…β–†β–… β–ˆ
  11.7 ΞΌs      Histogram: log(frequency) by time      32.7 ΞΌs <

 Memory estimate: 47.09 KiB, allocs estimate: 1001.

julia> @benchmark $uvprow = pairwiseuv_row($rrow)
BenchmarkTools.Trial: 181 samples with 1 evaluation.
 Range (min … max):  25.059 ms … 37.212 ms  β”Š GC (min … max): 0.00% … 25.63%
 Time  (median):     26.288 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   27.631 ms Β±  3.108 ms  β”Š GC (mean Β± Οƒ):  5.19% Β±  8.71%

  β–ˆβ–ˆβ–‡β–„β–…β–†β–†β–…β–„β–„β–†β–‡β–„β–ƒβ–ƒβ–‚β–‚β–„β–‚β–β–β–‚β–‚β–‚β–β–β–β–β–β–‚β–β–β–β–‚β–‚β–β–‚β–β–β–β–β–β–‚β–…β–ƒβ–ƒβ–β–ƒβ–ƒβ–ƒβ–ƒβ–„β–‚β–„β–ƒβ–β–β–β–‚ β–‚
  25.1 ms         Histogram: frequency by time        35.1 ms <

 Memory estimate: 22.91 MiB, allocs estimate: 5.

julia> @benchmark $uvpcol = pairwiseuv_col($rcol)
BenchmarkTools.Trial: 171 samples with 1 evaluation.
 Range (min … max):  26.860 ms … 39.529 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     27.741 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   29.267 ms Β±  3.223 ms  β”Š GC (mean Β± Οƒ):  4.95% Β± 8.35%

  β–ˆβ–ˆβ–…β–„β–†β–ƒβ–…β–„β–ƒβ–„β–…β–„β–‚β–„β–‚β–β–‚β–‚β–β–ƒβ–‚β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–…β–„β–ƒβ–ƒβ–‚β–ƒβ–ƒβ–‚β–β–ƒβ–β–‚β–β–β–β–ƒβ–‚β–‚β–β–β–‚ β–‚
  26.9 ms         Histogram: frequency by time          38 ms <

 Memory estimate: 22.91 MiB, allocs estimate: 5.

julia> @benchmark $uvpsv = pairwiseuv_sv($rsv)
BenchmarkTools.Trial: 348 samples with 1 evaluation.
 Range (min … max):   9.165 ms … 27.065 ms  β”Š GC (min … max):  0.00% … 25.63%
 Time  (median):     11.888 ms              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   14.368 ms Β±  3.707 ms  β”Š GC (mean Β± Οƒ):  20.11% Β± 19.27%

    ▁▁       β–ˆβ–‡β–ƒβ–                       β–ƒβ–„           β–‚β–…        
  β–„β–„β–ˆβ–ˆβ–‡β–ˆβ–†β–ˆβ–„β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–†β–β–„β–†β–†β–„β–β–β–„β–β–„β–β–†β–†β–†β–„β–„β–β–„β–‡β–ˆβ–ˆβ–ˆβ–‡β–†β–β–‡β–‡β–‡β–†β–„β–β–„β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–†β–„ β–‡
  9.17 ms      Histogram: log(frequency) by time      21.3 ms <

 Memory estimate: 45.78 MiB, allocs estimate: 6.

julia> @benchmark $uvpmv = pairwiseuv_mv($rmv)
BenchmarkTools.Trial: 191 samples with 1 evaluation.
 Range (min … max):  13.722 ms … 49.025 ms  β”Š GC (min … max):  0.00% … 34.56%
 Time  (median):     21.945 ms              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   26.188 ms Β± 11.302 ms  β”Š GC (mean Β± Οƒ):  26.16% Β± 21.42%

  β–‡β–ˆ         β–ƒ  β–ƒ        
  β–ˆβ–ˆβ–β–…β–„β–†β–β–ƒβ–ƒβ–†β–β–ˆβ–„β–ƒβ–ˆβ–ƒβ–β–ƒβ–ƒβ–ƒβ–ƒβ–β–β–β–…β–„β–„β–ƒβ–β–ƒβ–β–β–β–ƒβ–ƒβ–…β–ƒβ–…β–ƒβ–β–β–ƒβ–ƒβ–ƒβ–„β–β–…β–ƒβ–„β–ƒβ–„β–†β–…β–β–ƒβ–„β–ƒβ–„β–ƒ β–ƒ
  13.7 ms         Histogram: frequency by time        47.4 ms <

 Memory estimate: 45.78 MiB, allocs estimate: 1000006.

Although looking at the # of allocations, apparently the member S/Mvectors were not mutated, but rather replaced by new vectors, so this could be the reason for lousy performance? I have tried to use @views but this doesn’t seem to change the behavior.
Any pro tips?

do you have a reference β€œspeed” you think should but isn’t achieved?

Part of it could be that some operations don’t really benefit from working on a Vector of SVectors. This seems to be the case with diff(). There are no advantages in terms of avoiding allocations, you need to read and write to a a regular Vector in both cases. And the plain Matrix diff can be well optimized anyway.

Another thing is that, after looking more closely at your benchmarks, you are getting decent speedups for SVector.

  • nextuv_* goes from 22us to 5.7us, even though you create an extra unnecessary array in uvsv = map(unitvec,drsv), instead of doing it in-place, which you do for the regular matrix.
  • pairwiseuv_* goes from 25ms to 9ms, where you also don’t work in-place.

I read that small static arrays should be about 1 order faster so that’s what I hope to achieve.

And the plain Matrix diff can be well optimized anyway.

Can you elaborate on this? Coming from MATLAB I have β€œintrinsics are the fastest” printed in my mind.

Also, regarding in-place operation, apparently it’s impossible for SVectors.

julia> map(unitvec!,uvsv,uvsv)
ERROR: setindex!(::SVector{3, Float64}, value, ::Int) is not defined.
 Hint: Use `MArray` or `SizedArray` to create a mutable static array

My efforts to use MVectors, however, caused significant slowdowns. see the benchmark result for $uvpmv = pairwiseuv_mv($rmv) and $uvmv = nextuv_mv($rmv) above. It apparently introduced more allocations instead of less.

BTW, although a little off-topic, why is the speed for rrow and rcol essentially identical? The former stores xyz in each row and latter in each column. I would think accessing by row should be a lot slower than by column.

In julia intrinsics (mostly) don’t exist. They’re just julia functions someone else wrote.

That’s correct. For an SVector you cannot replace individual entries β€œin place” because there isn’t a valid notion of what β€œin place” means for an SVector. This sounds bad at first but is actually a feature.

You can do β€œin place” operations on a Vector or MVector because it must be kept in memory somewhere (except in some special cases where escape analysis can prove otherwise), which means on the heap. Therefore, they have a place and β€œin place” is possible.

An MVector has to have a place in memory so a Vector{<MVector} is often worse than a Matrix because those MVector each require their own place in memory and the Vector{<:MVector} contains pointers to each MVector.

But an SVector, being immutable, can exist wherever the compiler decides is best. Most of the time, they end up on the stack (much better than the heap) or in registers (even better than the stack). A Vector{<:SVector} will live on the heap, but any time a SVector is taken from it it can be put wherever.

You can’t mutate a SVector, but you can replace it entirely with a new SVector. The compiler can β€œreplace” it however it wants, either by putting it in a new place or by overwriting the old place if it decides it doesn’t need the old one any more. This freedom makes it strictly more powerful than if it was required to do the operation in place and on the heap.

But SVector’s immutability does make them inconvenient to work with… or at least inconvenient to use interchangeably with Vector. Maybe someday MVector will catch up (this requires better compiler smarts) and you can have the best of both. Until then, SVector is usually much better for performance than MVector.

Note that v = setindex(v, x, i) can be used when v is a SVector as essentially equivalent to v[i] = x. But in your example, just use v = unitvec(v) rather than unitvec!(v,v).

For a benchmark (here I use normalize/normalize! which should behave identically to your unitvec functions except that normalize! only takes 1 argument instead of 2):

julia> using LinearAlgebra, StaticArrays, BenchmarkTools

julia> xm = eachcol(rand(3,100));

julia> xs = rand(SVector{3,Float64},100);

julia> @btime foreach(normalize!,$xm)
  1.640 ΞΌs (0 allocations: 0 bytes)

julia> @btime $xs .= normalize.($xs);
  224.176 ns (0 allocations: 0 bytes)

julia> @btime map!(normalize,$xs,$xs); % equivalent to the previous
  226.923 ns (0 allocations: 0 bytes)

Not in-place in the SVector, but in-place in the Vector of SVectors. You are creating a new Vector{SVector} for the normalization , and that’s not necessary.


so, does it mean that replacing a SVector member in Vector{SVector} is more performant than mutating part of a regular array?
Not coming from a CS background, I am still confused about the stack vs heap thing. However, it occurs to me that one can use SVector to force intermediate arrays into faster memory (stack) inside a function? I often find r=... to be slower than equivalent x=... y=... z=... for storing intermediate results. Is it because complier doesn’t know the size of arrays, unlike scalars, and cannot put it into stack?

More question on SVectors:

  1. How many of them can you have? I read that stack is very limited, what hapeens if you have a vector of 10^6 or even 10^7 Svectors?
  2. Is Vector{Svector} organized in a specific layout in memory? Is it faster to access it sequentially? Or for Matrix{Svector}, is it faster to access along column than row?
  3. Can one pass Vector{Svector} to external FORTRAN libraries via ccall? Or it has to be β€˜collapsed’ into regular arrays?

Maybe we refer to different things by intrinscis. But I thought the functions from base julia ought be already optimized? no?

Replacing a SVector member in a Vector{<:SVector} is basically the same cost as replacing a similar number of elements in a normal Vector.

A SVector{N,T} is, under the hood, represented by a NTuple{N,T}. An NTuple{N,T} is basically the same as N individual variables of type T. You can imagine it that way. Like tuples, SArrays with more than a few hundred elements will be increasingly hard on the compiler.

A Vector{<:SVector} is stored on the heap like any Vector. This means that you can have as many SVectors inside one as your memory allows (just be careful that they aren’t too big, individually, or working with them might choke the compiler).

Any Array{T} stores each element of type T in a contiguous piece of memory, column-major (so that accessing along columns is usually better than along rows, which is better than increasingly higher dimensions).

Because of the above fact about layout, if you have an Array{SVector{N,T}}, you can reinterpret it to get an array-like object (something nearly interchangeable with Array{T}). If you can’t pass the Array{<:SVector} to the library, the reinterpret of it should be fine. You can also reinterpret a normal Array{T} to an Array{SVector{N,T}} if you want.

Thank you.
I always thought array of arrays Array{AbstractArray} is a array of pointers. so that itself is in contiguous memory storing pointers, but each points to a member that can be anywhere in memory. Good to know that it’s still in a contiguous block.

Can you elaborate more on the reinterpret? I can get it to work for Vector{SVector{N,T}} β†’ T, but not the other way around.

julia> reinterpret(Float64,rsv)
3000-element reinterpret(Float64, ::Vector{SVector{3, Float64}}):
julia> reinterpret(Vector{SVector{3,Float64}},rcol)
ERROR: ArgumentError: cannot reinterpret `Float64` as `Vector{SVector{3, Float64}}`, type `Vector{SVector{3, Float64}}` is not a bits type

I think you got it, but to be clear: each β€œelement” is stored contiguously. For isbitstype elements, this means the values. For most other values, this means a pointer to the element but that pointer could point to anywhere.

The type argument to reinterpret is the element type you’d like to have in the new array

julia> reinterpret(SVector{3,Float64},rcol)
1Γ—1000 reinterpret(SVector{3, Float64}, ::Matrix{Float64})

julia> reinterpret(reshape,SVector{3,Float64},rcol)
1000-element reinterpret(reshape, SVector{3, Float64}, ::Matrix{Float64}) with eltype SVector{3, Float64}:

The reinterpret(reshape,...) methods are useful to adding/removing a leading dimension (depending on the old versus new element type). See the help for reinterpret for more info.

1 Like

For Base, they are functions that are collectively written by competent developers with an in-depth understanding of Julia internals, and thoroughly reviewed and vetted by the community of users.

They are generally well written, but they won’t be faster than well written code written by anyone else.

Yes. But two points may be relevant:

  • They don’t generally have any privileged access to some internal β€œmagic sauce” that makes them faster and better.
  • They are often more general, and include guardrails (e.g. checking properties like NaN/Inf, ensuring correct order of operations, guarding against numerical overflow, etc. etc.)

In a user-written function one can exploit privileged information about the problem or domain to get better performance than Base functions.