Also consider b = ccall(:jl_reshape_array, Array{Float64,1}, (Any, Any, Any), Array{Float64,1}, v, (sizeof(v)>>3,)), if the reinterpret array gives you performance trouble on read/write access.
At the very least, benchmark jl_reshape against reinterpret.
If reinterpret-array gives you perf trouble, then best ask around here for help.
The jl_reshape is really what you want, except that it lies to the compiler about TBAA, in a way that may or may not lead to UB / miscompile in very specific circumstances in current, past or future versions of julia. (Iβm insufficiently up-to-date about the details)
The code for reinterpretArray is, on a naive reading, a complete performance-killing abomination. In many cases, llvm manages to remove the extraneous frills and the end result is OK; in some cases, llvm fails to optimize that, and you get a big slowdown.
Yes, creation of the reinterpret array is fast, but accessing the same data through the new ReinterpretArray is sometimes very slow.
It is unfortunate that afaiu Base does not supply a safe and reliably semi-performant way of doing that (the current ReinterpretArray is very hit-and-miss performance-wise, because it tries to work on the level of AbstractArray instead of the level of raw memory).
julia> using StaticArrays,BenchmarkTools
julia> v = [rand(SVector{8}) for _ in 1:1024]; r = reinterpret(Float64, v); b = ccall(:jl_reshape_array, Array{Float64,1}, (Any, Any, Any), Array{Float64,1}, v, (sizeof(v)>>3,));
julia> @benchmark sum(b)
BenchmarkTools.Trial: 10000 samples with 181 evaluations.
Range (min β¦ max): 592.331 ns β¦ 1.384 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 615.978 ns β GC (median): 0.00%
Time (mean Β± Ο): 624.047 ns Β± 42.785 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββ βββββ β
ββββββββββββββββββ ββ ββ β β βββββββ ββββββββββββ βββββββ βββ β ββ ββββ β
592 ns Histogram: log(frequency) by time 851 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
julia> @benchmark sum(r)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
Range (min β¦ max): 7.024 ΞΌs β¦ 18.987 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 7.041 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 7.117 ΞΌs Β± 651.430 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ β
ββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββ β
7.02 ΞΌs Histogram: log(frequency) by time 10.5 ΞΌs <
Memory estimate: 16 bytes, allocs estimate: 1.
Keno occasionally tries to fix Reinterpret performance by coaxing llvm to propertly optimize yet another case, but thatβs whack-a-mole.
Only reliably fast way is βinterpret this chunk of memory in the following wayβ (and yes, this means that strided views or transpose matrices fundamentally should not be reinterpreted! Because you donβt reinterpret arrays / sequences of numbers, you reinterpret ranges of raw memory. And sure this will expose rather sharp edges to users in relation to structure padding).
But that approach sits in the the unfortunate position of having to wait for proper formalization of juliaβs aliasing model / TBAA for several years.
If you want to help by either writing a fix or increasing awareness / pestering people, godspeed!
(I personally use jl_reshape, sprinkle @noinline and pray that the compiler doesnβt get smart enough to call me out)
julia> v= [rand(SVector{8}) for _ in 1:1024]; @cast c[jβi] := v[i][j] ;
julia> pointer(v), pointer(c)
(Ptr{SVector{8, Float64}} @0x00000000043ab280, Ptr{Float64} @0x0000000004907a00)
In many cases, the entire point of the flattened view is that it becomes possible to mutate single entries (i.e. you reinterpret Vector{SVector} as Matrix).