Usage of arrays of static arrays

Ender_L · February 21, 2023, 5:51am

I was tempted to use arrays of static 3-element vectors for x,y,z coordinates that will be used for many geometrical calculations. To test performance, I wrote some codes that basically calculates unit vectors along a curve and between each points on the curve.

using StaticArrays, LinearAlgebra, BenchmarkTools

function unitvec!(uv::AbstractArray, v::AbstractArray)
    uv .= v ./ norm(v)

    return uv
end
function unitvec(v::AbstractVector)
    uv = v ./ norm(v)

    return uv
end

function genCur(n)
    r = rand(n, 3)
    r[1, :] .= 0
    cumsum!(r, r, dims=1)

    rrow = copy(r)
    rcol = copy(transpose(r))
    rsv = copy(SVector{3}.(eachrow(r)))
    rmv = copy(MVector{3}.(eachrow(r)))
    return rrow, rcol, rsv, rmv
end

function nextuv_row(rrow)
    uvrow = diff(rrow, dims=1)
    #unitvec!.(eachrow(uvrow), eachrow(uvrow))
    @inbounds @simd for ii in axes(uvrow, 1)
        @views unitvec!(uvrow[ii, :], uvrow[ii, :])
    end

    return uvrow
end
function nextuv_col(rcol)
    uvcol = diff(rcol, dims=2)
    #unitvec!.(eachcol(uvcol), eachcol(uvcol))
    #unitvec!.((view(uvcol,:,jj) for jj=axes(uvcol,2)),(view(uvcol,:,jj) for jj=axes(uvcol,2)))
    @inbounds @simd for jj in axes(uvcol, 2)
        @views unitvec!(uvcol[:, jj], uvcol[:, jj])
    end

    return uvcol
end
function nextuv_sv(rsv)
    drsv = diff(rsv, dims=1)
    uvsv = map(unitvec,drsv)

    return uvsv
end
function nextuv_mv(rmv)
    uvmv = diff(rmv, dims=1)
    #@views unitvec!.(uvmv, uvmv)
    map(unitvec!,uvmv,uvmv)

    return uvmv
end

function pairwiseuv_row(rrow)
    uvprow = rrow .- rrow'[[CartesianIndex()], :, :]
    @inbounds for kk in axes(uvprow, 3), ii in axes(uvprow, 1)
        @views unitvec!(uvprow[ii, :, kk], uvprow[ii, :, kk])
    end

    return uvprow
end
function pairwiseuv_col(rcol)
    uvpcol = rcol .- rcol[:, [CartesianIndex()], :]
    @inbounds for kk in axes(uvpcol, 3), jj in axes(uvpcol, 2)
        @views unitvec!(uvpcol[:, jj, kk], uvpcol[:, jj, kk])
    end

    return uvpcol
end
function pairwiseuv_sv(rsv)
    drsv = rsv .- permutedims(rsv)
    #uvpsv = unitvec.(drsv)
    uvpsv = map(unitvec,drsv)

    return uvpsv
end
function pairwiseuv_mv(rmv)
    uvpmv = rmv .- permutedims(rmv)
    #@views unitvec!.(uvpmv, uvpmv)
    #@inbounds for ii in eachindex(uvpmv)
    #    unitvec!(view(uvpmv, ii), view(uvpmv, ii))
    #end
    map(unitvec!,uvpmv,uvpmv)

    return uvpmv
end

#----------------------------------------------------------------
rrow, rcol, rsv, rmv = genCur(1000);

uvrow = nextuv_row(rrow);
uvcol = nextuv_col(rcol);
uvsv = nextuv_sv(rsv);
uvmv = nextuv_mv(rmv);

uvprow = pairwiseuv_row(rrow);
uvpcol = pairwiseuv_col(rcol);
uvpsv = pairwiseuv_sv(rsv);
uvpmv = pairwiseuv_mv(rmv);

The result is kind of ,ehh, underwhelming?

julia> @benchmark $uvrow = nextuv_row($rrow)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  22.200 μs …  5.561 ms  ┊ GC (min … max): 0.00% … 99.37%
 Time  (median):     30.300 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   30.967 μs ± 92.985 μs  ┊ GC (mean ± σ):  5.19% ±  1.72%

  ▆▇▄▃ ▃▂   ▁▁▁▅▅▆█▆▅▃▄▄▃▂▁▂▁▁ ▁▁                             ▂
  ███████▇▅▇██████████████████████▇▇▆▃▅▄▃▃▁▄▄▅▅▅▄▄▅▄▅▅▄▃▁▄▆▇▆ █
  22.2 μs      Histogram: log(frequency) by time      51.4 μs <

 Memory estimate: 23.52 KiB, allocs estimate: 2.

julia> @benchmark $uvcol = nextuv_col($rcol)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.200 μs …  6.258 ms  ┊ GC (min … max): 0.00% … 99.46%
 Time  (median):     31.300 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.198 μs ± 88.420 μs  ┊ GC (mean ± σ):  4.80% ±  1.72%

  ▂▆▇▄▁▂▂   ▂▂            ▁▄▃ ▆█▇▄▄▄▃▂▁▄▃▁             ▁ ▁    ▂
  ████████████▇▆▆▆▂▃▇▆▅▆▃▄███████████████████▇▇▆▇▆▇▆▆▆██▇█▇▆▆ █
  23.2 μs      Histogram: log(frequency) by time      39.6 μs <

 Memory estimate: 23.52 KiB, allocs estimate: 2.

julia> @benchmark $uvsv = nextuv_sv($rsv)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):   5.660 μs …  1.598 ms  ┊ GC (min … max):  0.00% … 98.71%
 Time  (median):     21.520 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   19.479 μs ± 49.566 μs  ┊ GC (mean ± σ):  12.76% ±  5.09%

  █▇▃                      ▂▁▁▂▆▆▆▅▇▇▄▃▂▁▁                  ▁ ▂
  ████▆▆▇▇▇▄▄▄▃▆▅▄▃▄▅▃▅▄▁▁▇████████████████▇██▇▆▇▆▇▇▅▆▆▆▆▆▆██ █
  5.66 μs      Histogram: log(frequency) by time      36.4 μs <

 Memory estimate: 47.03 KiB, allocs estimate: 4.

julia> @benchmark $uvmv = nextuv_mv($rmv)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.700 μs …  2.052 ms  ┊ GC (min … max): 0.00% … 97.22%
 Time  (median):     12.400 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.273 μs ± 44.583 μs  ┊ GC (mean ± σ):  6.03% ±  2.20%

  ▄█▅▂▂▁ ▃▂ ▂▁▂▁                ▂ ▃▅▂▁  ▁▂                    ▂
  ██████▆████████▆▅▆▆▄▄▅▁▁▁▆█▅▄██▇███████████▇█▇▇▆▅▅▅▅▅▅▅▅▅▆▅ █
  11.7 μs      Histogram: log(frequency) by time      32.7 μs <

 Memory estimate: 47.09 KiB, allocs estimate: 1001.

julia> @benchmark $uvprow = pairwiseuv_row($rrow)
BenchmarkTools.Trial: 181 samples with 1 evaluation.
 Range (min … max):  25.059 ms … 37.212 ms  ┊ GC (min … max): 0.00% … 25.63%
 Time  (median):     26.288 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.631 ms ±  3.108 ms  ┊ GC (mean ± σ):  5.19% ±  8.71%

  ▃█              
  ██▇▄▅▆▆▅▄▄▆▇▄▃▃▂▂▄▂▁▁▂▂▂▁▁▁▁▁▂▁▁▁▂▂▁▂▁▁▁▁▁▂▅▃▃▁▃▃▃▃▄▂▄▃▁▁▁▂ ▂
  25.1 ms         Histogram: frequency by time        35.1 ms <

 Memory estimate: 22.91 MiB, allocs estimate: 5.

julia> @benchmark $uvpcol = pairwiseuv_col($rcol)
BenchmarkTools.Trial: 171 samples with 1 evaluation.
 Range (min … max):  26.860 ms … 39.529 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     27.741 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.267 ms ±  3.223 ms  ┊ GC (mean ± σ):  4.95% ± 8.35%

  █▅            
  ██▅▄▆▃▅▄▃▄▅▄▂▄▂▁▂▂▁▃▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▅▄▃▃▂▃▃▂▁▃▁▂▁▁▁▃▂▂▁▁▂ ▂
  26.9 ms         Histogram: frequency by time          38 ms <

 Memory estimate: 22.91 MiB, allocs estimate: 5.

julia> @benchmark $uvpsv = pairwiseuv_sv($rsv)
BenchmarkTools.Trial: 348 samples with 1 evaluation.
 Range (min … max):   9.165 ms … 27.065 ms  ┊ GC (min … max):  0.00% … 25.63%
 Time  (median):     11.888 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   14.368 ms ±  3.707 ms  ┊ GC (mean ± σ):  20.11% ± 19.27%

    ▁▁       █▇▃▁                       ▃▄           ▂▅        
  ▄▄██▇█▆█▄▄█████▇▆▆▆▁▄▆▆▄▁▁▄▁▄▁▆▆▆▄▄▁▄▇███▇▆▁▇▇▇▆▄▁▄████▇▄▆▄ ▇
  9.17 ms      Histogram: log(frequency) by time      21.3 ms <

 Memory estimate: 45.78 MiB, allocs estimate: 6.

julia> @benchmark $uvpmv = pairwiseuv_mv($rmv)
BenchmarkTools.Trial: 191 samples with 1 evaluation.
 Range (min … max):  13.722 ms … 49.025 ms  ┊ GC (min … max):  0.00% … 34.56%
 Time  (median):     21.945 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   26.188 ms ± 11.302 ms  ┊ GC (mean ± σ):  26.16% ± 21.42%

  ▇█         ▃  ▃        
  ██▁▅▄▆▁▃▃▆▁█▄▃█▃▁▃▃▃▃▁▁▁▅▄▄▃▁▃▁▁▁▃▃▅▃▅▃▁▁▃▃▃▄▁▅▃▄▃▄▆▅▁▃▄▃▄▃ ▃
  13.7 ms         Histogram: frequency by time        47.4 ms <

 Memory estimate: 45.78 MiB, allocs estimate: 1000006.

Although looking at the # of allocations, apparently the member S/Mvectors were not mutated, but rather replaced by new vectors, so this could be the reason for lousy performance? I have tried to use @views but this doesn’t seem to change the behavior.
Any pro tips?

jling · February 21, 2023, 6:01am

do you have a reference “speed” you think should but isn’t achieved?

DNF · February 21, 2023, 7:53am

Ender_L:

function nextuv_row(rrow)
    uvrow = diff(rrow, dims=1)
    #unitvec!.(eachrow(uvrow), eachrow(uvrow))
    @inbounds @simd for ii in axes(uvrow, 1)
        @views unitvec!(uvrow[ii, :], uvrow[ii, :])
    end
    return uvrow
end

Part of it could be that some operations don’t really benefit from working on a Vector of SVectors. This seems to be the case with diff(). There are no advantages in terms of avoiding allocations, you need to read and write to a a regular Vector in both cases. And the plain Matrix diff can be well optimized anyway.

Another thing is that, after looking more closely at your benchmarks, you are getting decent speedups for SVector.

nextuv_* goes from 22us to 5.7us, even though you create an extra unnecessary array in uvsv = map(unitvec,drsv), instead of doing it in-place, which you do for the regular matrix.
pairwiseuv_* goes from 25ms to 9ms, where you also don’t work in-place.

Ender_L · February 21, 2023, 6:14pm

I read that small static arrays should be about 1 order faster so that’s what I hope to achieve.

Ender_L · February 21, 2023, 6:17pm

And the plain Matrix diff can be well optimized anyway.

Can you elaborate on this? Coming from MATLAB I have “intrinsics are the fastest” printed in my mind.

Ender_L · February 21, 2023, 6:31pm

Also, regarding in-place operation, apparently it’s impossible for SVectors.

julia> map(unitvec!,uvsv,uvsv)
ERROR: setindex!(::SVector{3, Float64}, value, ::Int) is not defined.
 Hint: Use `MArray` or `SizedArray` to create a mutable static array

My efforts to use MVectors, however, caused significant slowdowns. see the benchmark result for $uvpmv = pairwiseuv_mv($rmv) and $uvmv = nextuv_mv($rmv) above. It apparently introduced more allocations instead of less.

BTW, although a little off-topic, why is the speed for rrow and rcol essentially identical? The former stores xyz in each row and latter in each column. I would think accessing by row should be a lot slower than by column.

Oscar_Smith · February 21, 2023, 6:34pm

In julia intrinsics (mostly) don’t exist. They’re just julia functions someone else wrote.

mikmoore · February 21, 2023, 8:27pm

That’s correct. For an SVector you cannot replace individual entries “in place” because there isn’t a valid notion of what “in place” means for an SVector. This sounds bad at first but is actually a feature.

You can do “in place” operations on a Vector or MVector because it must be kept in memory somewhere (except in some special cases where escape analysis can prove otherwise), which means on the heap. Therefore, they have a place and “in place” is possible.

An MVector has to have a place in memory so a Vector{<MVector} is often worse than a Matrix because those MVector each require their own place in memory and the Vector{<:MVector} contains pointers to each MVector.

But an SVector, being immutable, can exist wherever the compiler decides is best. Most of the time, they end up on the stack (much better than the heap) or in registers (even better than the stack). A Vector{<:SVector} will live on the heap, but any time a SVector is taken from it it can be put wherever.

You can’t mutate a SVector, but you can replace it entirely with a new SVector. The compiler can “replace” it however it wants, either by putting it in a new place or by overwriting the old place if it decides it doesn’t need the old one any more. This freedom makes it strictly more powerful than if it was required to do the operation in place and on the heap.

But SVector’s immutability does make them inconvenient to work with… or at least inconvenient to use interchangeably with Vector. Maybe someday MVector will catch up (this requires better compiler smarts) and you can have the best of both. Until then, SVector is usually much better for performance than MVector.

Note that v = setindex(v, x, i) can be used when v is a SVector as essentially equivalent to v[i] = x. But in your example, just use v = unitvec(v) rather than unitvec!(v,v).

For a benchmark (here I use normalize/normalize! which should behave identically to your unitvec functions except that normalize! only takes 1 argument instead of 2):

julia> using LinearAlgebra, StaticArrays, BenchmarkTools

julia> xm = eachcol(rand(3,100));

julia> xs = rand(SVector{3,Float64},100);

julia> @btime foreach(normalize!,$xm)
  1.640 μs (0 allocations: 0 bytes)

julia> @btime $xs .= normalize.($xs);
  224.176 ns (0 allocations: 0 bytes)

julia> @btime map!(normalize,$xs,$xs); % equivalent to the previous
  226.923 ns (0 allocations: 0 bytes)

DNF · February 21, 2023, 9:23pm

Not in-place in the SVector, but in-place in the Vector of SVectors. You are creating a new Vector{SVector} for the normalization , and that’s not necessary.

Ender_L · February 22, 2023, 2:43am

so, does it mean that replacing a SVector member in Vector{SVector} is more performant than mutating part of a regular array?
Not coming from a CS background, I am still confused about the stack vs heap thing. However, it occurs to me that one can use SVector to force intermediate arrays into faster memory (stack) inside a function? I often find r=... to be slower than equivalent x=... y=... z=... for storing intermediate results. Is it because complier doesn’t know the size of arrays, unlike scalars, and cannot put it into stack?

More question on SVectors:

How many of them can you have? I read that stack is very limited, what hapeens if you have a vector of 10^6 or even 10^7 Svectors?
Is Vector{Svector} organized in a specific layout in memory? Is it faster to access it sequentially? Or for Matrix{Svector}, is it faster to access along column than row?
Can one pass Vector{Svector} to external FORTRAN libraries via ccall? Or it has to be ‘collapsed’ into regular arrays?

Ender_L · February 22, 2023, 2:49am

Maybe we refer to different things by intrinscis. But I thought the functions from base julia ought be already optimized? no?

mikmoore · February 22, 2023, 3:15am

Replacing a SVector member in a Vector{<:SVector} is basically the same cost as replacing a similar number of elements in a normal Vector.

A SVector{N,T} is, under the hood, represented by a NTuple{N,T}. An NTuple{N,T} is basically the same as N individual variables of type T. You can imagine it that way. Like tuples, SArrays with more than a few hundred elements will be increasingly hard on the compiler.

A Vector{<:SVector} is stored on the heap like any Vector. This means that you can have as many SVectors inside one as your memory allows (just be careful that they aren’t too big, individually, or working with them might choke the compiler).

Any Array{T} stores each element of type T in a contiguous piece of memory, column-major (so that accessing along columns is usually better than along rows, which is better than increasingly higher dimensions).

Because of the above fact about layout, if you have an Array{SVector{N,T}}, you can reinterpret it to get an array-like object (something nearly interchangeable with Array{T}). If you can’t pass the Array{<:SVector} to the library, the reinterpret of it should be fine. You can also reinterpret a normal Array{T} to an Array{SVector{N,T}} if you want.

Ender_L · February 22, 2023, 5:28am

Thank you.
I always thought array of arrays Array{AbstractArray} is a array of pointers. so that itself is in contiguous memory storing pointers, but each points to a member that can be anywhere in memory. Good to know that it’s still in a contiguous block.

Can you elaborate more on the reinterpret? I can get it to work for Vector{SVector{N,T}} → T, but not the other way around.

julia> reinterpret(Float64,rsv)
3000-element reinterpret(Float64, ::Vector{SVector{3, Float64}}):
...
julia> reinterpret(Vector{SVector{3,Float64}},rcol)
ERROR: ArgumentError: cannot reinterpret `Float64` as `Vector{SVector{3, Float64}}`, type `Vector{SVector{3, Float64}}` is not a bits type
...

mikmoore · February 22, 2023, 5:42am

I think you got it, but to be clear: each “element” is stored contiguously. For isbitstype elements, this means the values. For most other values, this means a pointer to the element but that pointer could point to anywhere.

The type argument to reinterpret is the element type you’d like to have in the new array

julia> reinterpret(SVector{3,Float64},rcol)
1×1000 reinterpret(SVector{3, Float64}, ::Matrix{Float64})

julia> reinterpret(reshape,SVector{3,Float64},rcol)
1000-element reinterpret(reshape, SVector{3, Float64}, ::Matrix{Float64}) with eltype SVector{3, Float64}:

The reinterpret(reshape,...) methods are useful to adding/removing a leading dimension (depending on the old versus new element type). See the help for reinterpret for more info.

jishnub · February 22, 2023, 5:43am

For Base, they are functions that are collectively written by competent developers with an in-depth understanding of Julia internals, and thoroughly reviewed and vetted by the community of users.

Oscar_Smith · February 22, 2023, 6:32am

They are generally well written, but they won’t be faster than well written code written by anyone else.

DNF · February 22, 2023, 8:03am

Yes. But two points may be relevant:

They don’t generally have any privileged access to some internal “magic sauce” that makes them faster and better.
They are often more general, and include guardrails (e.g. checking properties like NaN/Inf, ensuring correct order of operations, guarding against numerical overflow, etc. etc.)

In a user-written function one can exploit privileged information about the problem or domain to get better performance than Base functions.

Topic		Replies	Views
Local arrays in for loops New to Julia	55	1810	March 3, 2021
Best practices for handling arrays of points General Usage	3	973	January 22, 2018
Performance assigning and copying with StaticArrays.jl New to Julia performance , memory-allocation , staticarrays	4	745	January 7, 2022
When to use StaticArrays Performance	4	753	September 5, 2020
Static Arrays in pure Julia Performance package , performance , array	2	4712	September 3, 2020

Usage of arrays of static arrays

Related topics