# Usage of arrays of static arrays

I was tempted to use arrays of static 3-element vectors for x,y,z coordinates that will be used for many geometrical calculations. To test performance, I wrote some codes that basically calculates unit vectors along a curve and between each points on the curve.

``````using StaticArrays, LinearAlgebra, BenchmarkTools

function unitvec!(uv::AbstractArray, v::AbstractArray)
uv .= v ./ norm(v)

return uv
end
function unitvec(v::AbstractVector)
uv = v ./ norm(v)

return uv
end

function genCur(n)
r = rand(n, 3)
r[1, :] .= 0
cumsum!(r, r, dims=1)

rrow = copy(r)
rcol = copy(transpose(r))
rsv = copy(SVector{3}.(eachrow(r)))
rmv = copy(MVector{3}.(eachrow(r)))
return rrow, rcol, rsv, rmv
end

function nextuv_row(rrow)
uvrow = diff(rrow, dims=1)
#unitvec!.(eachrow(uvrow), eachrow(uvrow))
@inbounds @simd for ii in axes(uvrow, 1)
@views unitvec!(uvrow[ii, :], uvrow[ii, :])
end

return uvrow
end
function nextuv_col(rcol)
uvcol = diff(rcol, dims=2)
#unitvec!.(eachcol(uvcol), eachcol(uvcol))
#unitvec!.((view(uvcol,:,jj) for jj=axes(uvcol,2)),(view(uvcol,:,jj) for jj=axes(uvcol,2)))
@inbounds @simd for jj in axes(uvcol, 2)
@views unitvec!(uvcol[:, jj], uvcol[:, jj])
end

return uvcol
end
function nextuv_sv(rsv)
drsv = diff(rsv, dims=1)
uvsv = map(unitvec,drsv)

return uvsv
end
function nextuv_mv(rmv)
uvmv = diff(rmv, dims=1)
#@views unitvec!.(uvmv, uvmv)
map(unitvec!,uvmv,uvmv)

return uvmv
end

function pairwiseuv_row(rrow)
uvprow = rrow .- rrow'[[CartesianIndex()], :, :]
@inbounds for kk in axes(uvprow, 3), ii in axes(uvprow, 1)
@views unitvec!(uvprow[ii, :, kk], uvprow[ii, :, kk])
end

return uvprow
end
function pairwiseuv_col(rcol)
uvpcol = rcol .- rcol[:, [CartesianIndex()], :]
@inbounds for kk in axes(uvpcol, 3), jj in axes(uvpcol, 2)
@views unitvec!(uvpcol[:, jj, kk], uvpcol[:, jj, kk])
end

return uvpcol
end
function pairwiseuv_sv(rsv)
drsv = rsv .- permutedims(rsv)
#uvpsv = unitvec.(drsv)
uvpsv = map(unitvec,drsv)

return uvpsv
end
function pairwiseuv_mv(rmv)
uvpmv = rmv .- permutedims(rmv)
#@views unitvec!.(uvpmv, uvpmv)
#@inbounds for ii in eachindex(uvpmv)
#    unitvec!(view(uvpmv, ii), view(uvpmv, ii))
#end
map(unitvec!,uvpmv,uvpmv)

return uvpmv
end

#----------------------------------------------------------------
rrow, rcol, rsv, rmv = genCur(1000);

uvrow = nextuv_row(rrow);
uvcol = nextuv_col(rcol);
uvsv = nextuv_sv(rsv);
uvmv = nextuv_mv(rmv);

uvprow = pairwiseuv_row(rrow);
uvpcol = pairwiseuv_col(rcol);
uvpsv = pairwiseuv_sv(rsv);
uvpmv = pairwiseuv_mv(rmv);
``````

The result is kind of ,ehh, underwhelming?

``````julia> @benchmark \$uvrow = nextuv_row(\$rrow)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  22.200 ΞΌs β¦  5.561 ms  β GC (min β¦ max): 0.00% β¦ 99.37%
Time  (median):     30.300 ΞΌs              β GC (median):    0.00%
Time  (mean Β± Ο):   30.967 ΞΌs Β± 92.985 ΞΌs  β GC (mean Β± Ο):  5.19% Β±  1.72%

ββββ ββ   ββββββββββββββββββ ββ                             β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
22.2 ΞΌs      Histogram: log(frequency) by time      51.4 ΞΌs <

Memory estimate: 23.52 KiB, allocs estimate: 2.

julia> @benchmark \$uvcol = nextuv_col(\$rcol)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  23.200 ΞΌs β¦  6.258 ms  β GC (min β¦ max): 0.00% β¦ 99.46%
Time  (median):     31.300 ΞΌs              β GC (median):    0.00%
Time  (mean Β± Ο):   31.198 ΞΌs Β± 88.420 ΞΌs  β GC (mean Β± Ο):  4.80% Β±  1.72%

βββββββ   ββ            βββ ββββββββββββ             β β    β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
23.2 ΞΌs      Histogram: log(frequency) by time      39.6 ΞΌs <

Memory estimate: 23.52 KiB, allocs estimate: 2.

julia> @benchmark \$uvsv = nextuv_sv(\$rsv)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min β¦ max):   5.660 ΞΌs β¦  1.598 ms  β GC (min β¦ max):  0.00% β¦ 98.71%
Time  (median):     21.520 ΞΌs              β GC (median):     0.00%
Time  (mean Β± Ο):   19.479 ΞΌs Β± 49.566 ΞΌs  β GC (mean Β± Ο):  12.76% Β±  5.09%

βββ                      βββββββββββββββ                  β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
5.66 ΞΌs      Histogram: log(frequency) by time      36.4 ΞΌs <

Memory estimate: 47.03 KiB, allocs estimate: 4.

julia> @benchmark \$uvmv = nextuv_mv(\$rmv)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  11.700 ΞΌs β¦  2.052 ms  β GC (min β¦ max): 0.00% β¦ 97.22%
Time  (median):     12.400 ΞΌs              β GC (median):    0.00%
Time  (mean Β± Ο):   16.273 ΞΌs Β± 44.583 ΞΌs  β GC (mean Β± Ο):  6.03% Β±  2.20%

ββββββ ββ ββββ                β ββββ  ββ                    β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
11.7 ΞΌs      Histogram: log(frequency) by time      32.7 ΞΌs <

Memory estimate: 47.09 KiB, allocs estimate: 1001.

julia> @benchmark \$uvprow = pairwiseuv_row(\$rrow)
BenchmarkTools.Trial: 181 samples with 1 evaluation.
Range (min β¦ max):  25.059 ms β¦ 37.212 ms  β GC (min β¦ max): 0.00% β¦ 25.63%
Time  (median):     26.288 ms              β GC (median):    0.00%
Time  (mean Β± Ο):   27.631 ms Β±  3.108 ms  β GC (mean Β± Ο):  5.19% Β±  8.71%

ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
25.1 ms         Histogram: frequency by time        35.1 ms <

Memory estimate: 22.91 MiB, allocs estimate: 5.

julia> @benchmark \$uvpcol = pairwiseuv_col(\$rcol)
BenchmarkTools.Trial: 171 samples with 1 evaluation.
Range (min β¦ max):  26.860 ms β¦ 39.529 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     27.741 ms              β GC (median):    0.00%
Time  (mean Β± Ο):   29.267 ms Β±  3.223 ms  β GC (mean Β± Ο):  4.95% Β± 8.35%

ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
26.9 ms         Histogram: frequency by time          38 ms <

Memory estimate: 22.91 MiB, allocs estimate: 5.

julia> @benchmark \$uvpsv = pairwiseuv_sv(\$rsv)
BenchmarkTools.Trial: 348 samples with 1 evaluation.
Range (min β¦ max):   9.165 ms β¦ 27.065 ms  β GC (min β¦ max):  0.00% β¦ 25.63%
Time  (median):     11.888 ms              β GC (median):     0.00%
Time  (mean Β± Ο):   14.368 ms Β±  3.707 ms  β GC (mean Β± Ο):  20.11% Β± 19.27%

ββ       ββββ                       ββ           ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
9.17 ms      Histogram: log(frequency) by time      21.3 ms <

Memory estimate: 45.78 MiB, allocs estimate: 6.

julia> @benchmark \$uvpmv = pairwiseuv_mv(\$rmv)
BenchmarkTools.Trial: 191 samples with 1 evaluation.
Range (min β¦ max):  13.722 ms β¦ 49.025 ms  β GC (min β¦ max):  0.00% β¦ 34.56%
Time  (median):     21.945 ms              β GC (median):     0.00%
Time  (mean Β± Ο):   26.188 ms Β± 11.302 ms  β GC (mean Β± Ο):  26.16% Β± 21.42%

ββ         β  β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
13.7 ms         Histogram: frequency by time        47.4 ms <

Memory estimate: 45.78 MiB, allocs estimate: 1000006.
``````

Although looking at the # of allocations, apparently the member S/Mvectors were not mutated, but rather replaced by new vectors, so this could be the reason for lousy performance? I have tried to use `@views` but this doesnβt seem to change the behavior.
Any pro tips?

do you have a reference βspeedβ you think should but isnβt achieved?

Part of it could be that some operations donβt really benefit from working on a Vector of SVectors. This seems to be the case with `diff()`. There are no advantages in terms of avoiding allocations, you need to read and write to a a regular `Vector` in both cases. And the plain `Matrix` `diff` can be well optimized anyway.

Another thing is that, after looking more closely at your benchmarks, you are getting decent speedups for `SVector`.

• `nextuv_*` goes from 22us to 5.7us, even though you create an extra unnecessary array in `uvsv = map(unitvec,drsv)`, instead of doing it in-place, which you do for the regular matrix.
• `pairwiseuv_*` goes from 25ms to 9ms, where you also donβt work in-place.
3 Likes

I read that small static arrays should be about 1 order faster so thatβs what I hope to achieve.

And the plain `Matrix` `diff` can be well optimized anyway.

Can you elaborate on this? Coming from MATLAB I have βintrinsics are the fastestβ printed in my mind.

Also, regarding in-place operation, apparently itβs impossible for SVectors.

``````julia> map(unitvec!,uvsv,uvsv)
ERROR: setindex!(::SVector{3, Float64}, value, ::Int) is not defined.
Hint: Use `MArray` or `SizedArray` to create a mutable static array
``````

My efforts to use MVectors, however, caused significant slowdowns. see the benchmark result for `\$uvpmv = pairwiseuv_mv(\$rmv)` and `\$uvmv = nextuv_mv(\$rmv)` above. It apparently introduced more allocations instead of less.

BTW, although a little off-topic, why is the speed for `rrow` and `rcol` essentially identical? The former stores xyz in each row and latter in each column. I would think accessing by row should be a lot slower than by column.

In julia intrinsics (mostly) donβt exist. Theyβre just julia functions someone else wrote.

Thatβs correct. For an `SVector` you cannot replace individual entries βin placeβ because there isnβt a valid notion of what βin placeβ means for an `SVector`. This sounds bad at first but is actually a feature.

You can do βin placeβ operations on a `Vector` or `MVector` because it must be kept in memory somewhere (except in some special cases where escape analysis can prove otherwise), which means on the heap. Therefore, they have a place and βin placeβ is possible.

An `MVector` has to have a place in memory so a `Vector{<MVector}` is often worse than a `Matrix` because those `MVector` each require their own place in memory and the `Vector{<:MVector}` contains pointers to each `MVector`.

But an `SVector`, being immutable, can exist wherever the compiler decides is best. Most of the time, they end up on the stack (much better than the heap) or in registers (even better than the stack). A `Vector{<:SVector}` will live on the heap, but any time a `SVector` is taken from it it can be put wherever.

You canβt mutate a `SVector`, but you can replace it entirely with a new `SVector`. The compiler can βreplaceβ it however it wants, either by putting it in a new place or by overwriting the old place if it decides it doesnβt need the old one any more. This freedom makes it strictly more powerful than if it was required to do the operation in place and on the heap.

But `SVector`βs immutability does make them inconvenient to work withβ¦ or at least inconvenient to use interchangeably with `Vector`. Maybe someday `MVector` will catch up (this requires better compiler smarts) and you can have the best of both. Until then, `SVector` is usually much better for performance than `MVector`.

Note that `v = setindex(v, x, i)` can be used when `v` is a `SVector` as essentially equivalent to `v[i] = x`. But in your example, just use `v = unitvec(v)` rather than `unitvec!(v,v)`.

For a benchmark (here I use `normalize`/`normalize!` which should behave identically to your `unitvec` functions except that `normalize!` only takes 1 argument instead of 2):

``````julia> using LinearAlgebra, StaticArrays, BenchmarkTools

julia> xm = eachcol(rand(3,100));

julia> xs = rand(SVector{3,Float64},100);

julia> @btime foreach(normalize!,\$xm)
1.640 ΞΌs (0 allocations: 0 bytes)

julia> @btime \$xs .= normalize.(\$xs);
224.176 ns (0 allocations: 0 bytes)

julia> @btime map!(normalize,\$xs,\$xs); % equivalent to the previous
226.923 ns (0 allocations: 0 bytes)
``````
3 Likes

Not in-place in the SVector, but in-place in the Vector of SVectors. You are creating a new `Vector{SVector}` for the normalization , and thatβs not necessary.

2 Likes

so, does it mean that replacing a `SVector` member in `Vector{SVector}` is more performant than mutating part of a regular array?
Not coming from a CS background, I am still confused about the stack vs heap thing. However, it occurs to me that one can use `SVector` to force intermediate arrays into faster memory (stack) inside a function? I often find `r=...` to be slower than equivalent `x=... y=... z=...` for storing intermediate results. Is it because complier doesnβt know the size of arrays, unlike scalars, and cannot put it into stack?

More question on `SVector`s:

1. How many of them can you have? I read that stack is very limited, what hapeens if you have a vector of 10^6 or even 10^7 `Svector`s?
2. Is `Vector{Svector}` organized in a specific layout in memory? Is it faster to access it sequentially? Or for `Matrix{Svector}`, is it faster to access along column than row?
3. Can one pass `Vector{Svector}` to external FORTRAN libraries via `ccall`? Or it has to be βcollapsedβ into regular arrays?

Maybe we refer to different things by intrinscis. But I thought the functions from base julia ought be already optimized? no?

Replacing a `SVector` member in a `Vector{<:SVector}` is basically the same cost as replacing a similar number of elements in a normal `Vector`.

A `SVector{N,T}` is, under the hood, represented by a `NTuple{N,T}`. An `NTuple{N,T}` is basically the same as `N` individual variables of type `T`. You can imagine it that way. Like tuples, `SArray`s with more than a few hundred elements will be increasingly hard on the compiler.

A `Vector{<:SVector}` is stored on the heap like any `Vector`. This means that you can have as many `SVector`s inside one as your memory allows (just be careful that they arenβt too big, individually, or working with them might choke the compiler).

Any `Array{T}` stores each element of type `T` in a contiguous piece of memory, column-major (so that accessing along columns is usually better than along rows, which is better than increasingly higher dimensions).

Because of the above fact about layout, if you have an `Array{SVector{N,T}}`, you can `reinterpret` it to get an array-like object (something nearly interchangeable with `Array{T}`). If you canβt pass the `Array{<:SVector}` to the library, the `reinterpret` of it should be fine. You can also `reinterpret` a normal `Array{T}` to an `Array{SVector{N,T}}` if you want.

Thank you.
I always thought array of arrays `Array{AbstractArray}` is a array of pointers. so that itself is in contiguous memory storing pointers, but each points to a member that can be anywhere in memory. Good to know that itβs still in a contiguous block.

Can you elaborate more on the `reinterpret`? I can get it to work for Vector{SVector{N,T}} β T, but not the other way around.

``````julia> reinterpret(Float64,rsv)
3000-element reinterpret(Float64, ::Vector{SVector{3, Float64}}):
...
julia> reinterpret(Vector{SVector{3,Float64}},rcol)
ERROR: ArgumentError: cannot reinterpret `Float64` as `Vector{SVector{3, Float64}}`, type `Vector{SVector{3, Float64}}` is not a bits type
...
``````

I think you got it, but to be clear: each βelementβ is stored contiguously. For `isbitstype` elements, this means the values. For most other values, this means a pointer to the element but that pointer could point to anywhere.

The type argument to `reinterpret` is the element type youβd like to have in the new array

``````julia> reinterpret(SVector{3,Float64},rcol)
1Γ1000 reinterpret(SVector{3, Float64}, ::Matrix{Float64})

julia> reinterpret(reshape,SVector{3,Float64},rcol)
1000-element reinterpret(reshape, SVector{3, Float64}, ::Matrix{Float64}) with eltype SVector{3, Float64}:
``````

The `reinterpret(reshape,...)` methods are useful to adding/removing a leading dimension (depending on the old versus new element type). See the help for `reinterpret` for more info.

1 Like

For `Base`, they are functions that are collectively written by competent developers with an in-depth understanding of Julia internals, and thoroughly reviewed and vetted by the community of users.

They are generally well written, but they wonβt be faster than well written code written by anyone else.

Yes. But two points may be relevant:

• They donβt generally have any privileged access to some internal βmagic sauceβ that makes them faster and better.
• They are often more general, and include guardrails (e.g. checking properties like NaN/Inf, ensuring correct order of operations, guarding against numerical overflow, etc. etc.)

In a user-written function one can exploit privileged information about the problem or domain to get better performance than Base functions.

4 Likes