Flattening view of `Vector{SVector{N}}`

Suppose that v is a Vector of SVector (a concrete type), eg

using StaticArrays
v = [rand(SVector{3}) for _ in 1:10]

Is there a package that implements a non-allocationg view equivalent to reduce(vcat, v)::Vector?

Note that

  1. Iterators.flatten kind of does this, but not special-cased for SVectors, so size calculation goes through all elements
  2. I can use vec(SplitApplyCombine.combinedimsview(v)), but again that does not specialize to SVectors.

In my actual application length(v) is large. I can implement this easily, just asking if there is a package already.

Does reinterpret(Float64, v) not do what you want?

5 Likes

yes, it does, thank you!

Also consider b = ccall(:jl_reshape_array, Array{Float64,1}, (Any, Any, Any), Array{Float64,1}, v, (sizeof(v)>>3,)), if the reinterpret array gives you performance trouble on read/write access.

At the very least, benchmark jl_reshape against reinterpret.

If reinterpret-array gives you perf trouble, then best ask around here for help.

The jl_reshape is really what you want, except that it lies to the compiler about TBAA, in a way that may or may not lead to UB / miscompile in very specific circumstances in current, past or future versions of julia. (I’m insufficiently up-to-date about the details)

The code for reinterpretArray is, on a naive reading, a complete performance-killing abomination. In many cases, llvm manages to remove the extraneous frills and the end result is OK; in some cases, llvm fails to optimize that, and you get a big slowdown.

What you mean, afaiu it just creates a new container for the same data. Thus, is usually very fast:

julia> x = rand(SVector{3,Float64}, 10^6);

julia> @btime reinterpret(Float64, $x);
  2.534 ns (0 allocations: 0 bytes)

Yes, creation of the reinterpret array is fast, but accessing the same data through the new ReinterpretArray is sometimes very slow.

It is unfortunate that afaiu Base does not supply a safe and reliably semi-performant way of doing that (the current ReinterpretArray is very hit-and-miss performance-wise, because it tries to work on the level of AbstractArray instead of the level of raw memory).

julia> using StaticArrays,BenchmarkTools

julia> v = [rand(SVector{8}) for _ in 1:1024]; r = reinterpret(Float64, v); b = ccall(:jl_reshape_array, Array{Float64,1}, (Any, Any, Any), Array{Float64,1}, v, (sizeof(v)>>3,));

julia> @benchmark sum(b)
BenchmarkTools.Trial: 10000 samples with 181 evaluations.
 Range (min … max):  592.331 ns …  1.384 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     615.978 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   624.047 ns Β± 42.785 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‚β–„β–†β–‡β–ˆβ–ˆβ–‡β–†β–…β–ƒβ–‚β–‚β–β–                                               β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–β–…β–„β–…β–…β–…β–†β–†β–†β–‡β–†β–†β–…β–†β–‡β–ˆβ–ˆβ–‡β–‡β–‡β–†β–†β–‡β–†β–…β–†β–†β–ƒβ–„β–ƒβ–†β–…β–†β–†β–…β–…β–†β–…β–†β–‡β–‡β–‡ β–ˆ
  592 ns        Histogram: log(frequency) by time       851 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(r)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):  7.024 ΞΌs …  18.987 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     7.041 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   7.117 ΞΌs Β± 651.430 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–‚β–                                                         ▁
  β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–†β–…β–„β–ƒβ–ƒβ–ƒβ–β–β–β–„β–β–β–β–ƒβ–ƒβ–β–ƒβ–β–ƒβ–β–ƒβ–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–ƒβ–ƒβ–ƒβ–… β–ˆ
  7.02 ΞΌs      Histogram: log(frequency) by time      10.5 ΞΌs <

 Memory estimate: 16 bytes, allocs estimate: 1.
3 Likes

Are you aware of any open issue regarding that? If there isn’t probably it is a good idea to fill one.

FWIW, the TensorCast.jl flattening view has the same performance on access as the ccall():

using StaticArrays
v = [rand(SVector{8}) for _ in 1:1024]
r = reinterpret(Float64, v)
b = ccall(:jl_reshape_array, Array{Float64,1}, (Any, Any, Any), Array{Float64,1}, v, (sizeof(v)>>3,))

using TensorCast
@cast c[jβŠ—i] := v[i][j] 

@btime sum($r)      # 5.367 ΞΌs (0 allocs: 0 bytes)
@btime sum($b)      # 730 ns (0 allocs: 0 bytes)
@btime sum($c)      # 729 ns (0 allocs: 0 bytes)
1 Like

Lots of them, cf Issues Β· JuliaLang/julia Β· GitHub

Keno occasionally tries to fix Reinterpret performance by coaxing llvm to propertly optimize yet another case, but that’s whack-a-mole.

Only reliably fast way is β€œinterpret this chunk of memory in the following way” (and yes, this means that strided views or transpose matrices fundamentally should not be reinterpreted! Because you don’t reinterpret arrays / sequences of numbers, you reinterpret ranges of raw memory. And sure this will expose rather sharp edges to users in relation to structure padding).

But that approach sits in the the unfortunate position of having to wait for proper formalization of julia’s aliasing model / TBAA for several years.

If you want to help by either writing a fix or increasing awareness / pestering people, godspeed!

(I personally use jl_reshape, sprinkle @noinline and pray that the compiler doesn’t get smart enough to call me out)

1 Like

It doesn’t. It creates a copy:

julia> v= [rand(SVector{8}) for _ in 1:1024]; @cast c[jβŠ—i] := v[i][j] ;

julia> pointer(v), pointer(c)
(Ptr{SVector{8, Float64}} @0x00000000043ab280, Ptr{Float64} @0x0000000004907a00)

In many cases, the entire point of the flattened view is that it becomes possible to mutate single entries (i.e. you reinterpret Vector{SVector} as Matrix).

3 Likes

@Elrod has a package that handles the strided arrays case:

Thanks @fooobar_lv2. And sorry for the bad TensorCast.jl pointer, as the expression with the := does not produce a view in this case.