ReinterpretedArray Performance (even worse on 1.8)

I’d like to use reinterpreted arrays to write into and read from an array. Here is a little script that is a bit oversimplified but not too far from my real use-case.

using BenchmarkTools

function cheb!(A, x)
   A[1] = 1 
   A[2] = x 
   for n = 3:length(A) 
      A[n] = 2 * x * A[n-1] - A[n-2]
   end
end

# Standard Array 
A = zeros(100)
# Reinterpreted Array 
B = reinterpret(Float64, zeros(UInt8, 100 * sizeof(Float64)))

# simple benchmark
x = rand()
print("           Array: "); @btime cheb!($A, $x)
print("ReinterpretArray: "); @btime cheb!($B, $x)

I always assumed that the abstraction here would be free, but apparently not. It is ok, not great on Julia 1.7 but terrible on Julia 1.8:

Output:

> j17 chebtest.jl
           Array:    454.965 ns (0 allocations: 0 bytes)
ReinterpretArray:   690.476 ns (0 allocations: 0 bytes)
> j18 chebtest.jl
           Array:   198.605 ns (0 allocations: 0 bytes)
ReinterpretArray:   965.647 ns (0 allocations: 0 bytes)

Julia Versions:

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635* (2022-02-06 15:21 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.2.0)
  CPU: Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)

julia> versioninfo()
Julia Version 1.8.0-beta3
Commit 3e092a2521 (2022-03-29 15:42 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.2.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores

Updated script, adding @inbounds, and two more tests. unsafe_wrap and UnsafeArrays into the mix. This largely seems to resolve the problem?

using BenchmarkTools, UnsafeArrays

function cheb!(A, x)
   A[1] = 1 
   A[2] = x 
   for n = 3:length(A) 
      @inbounds A[n] = 2 * x * A[n-1] - A[n-2]
   end
end

# Standard Array 
A = zeros(100)
# Reinterpreted Array 
B = reinterpret(Float64, zeros(UInt8, 100 * sizeof(Float64)))
# unsafe_wrap 
_C = zeros(UInt8, 100 * sizeof(Float64))
ptr = Base.unsafe_convert(Ptr{Float64}, _C)
C = Base.unsafe_wrap(Array, ptr, 100)
# UnsafeArrays
D = UnsafeArray(ptr, (100,))

# simple benchmark
x = rand()
print("           Array: "); @btime cheb!($A, $x)
print("ReinterpretArray: "); @btime cheb!($B, $x)
print("          unsafe: "); @btime cheb!($C, $x)
print("     UnsafeArray: "); @btime cheb!($D, $x)

Results:

> j17 chebtest.jl                                    7s
           Array:   169.928 ns (0 allocations: 0 bytes)
ReinterpretArray:   426.822 ns (0 allocations: 0 bytes)
          unsafe:   170.455 ns (0 allocations: 0 bytes)
     UnsafeArray:   187.381 ns (0 allocations: 0 bytes)

> j18 chebtest.jl                                    8s
           Array:   185.108 ns (0 allocations: 0 bytes)
ReinterpretArray:   230.561 ns (0 allocations: 0 bytes)
          unsafe:   183.610 ns (0 allocations: 0 bytes)
     UnsafeArray:   189.774 ns (0 allocations: 0 bytes)

I thought maybe something weird about the M1, but same on an AMD EPYC-ROME:

j17 test_cheb.jl 
           Array:   373.473 ns (0 allocations: 0 bytes)
ReinterpretArray:   962.550 ns (0 allocations: 0 bytes)
          unsafe:   373.473 ns (0 allocations: 0 bytes)
     UnsafeArray:   161.749 ns (0 allocations: 0 bytes)

EDIT: adding this is without @inbounds, so it seems that UnsafeArray doesn’t do bound checks which explains this behaviour…

I once again ask for Base to have this:

function unsafe_arraycast(::Type{D}, ary::Vector{S}) where {S, D}
    l = sizeof(S)*length(ary)÷sizeof(D)
    res = ccall(:jl_reshape_array, Vector{D}, (Any, Any, Any), Vector{D}, ary, (l,))
    return res
end

how is this related to Base.unsafe_wrap and to UnsafeArrays?

I think is basically what you did with unsafe_wrap + unsafe_convert.

And maybe my most important Question - is there a reason for me to not use unsafe_wrap or UnsafeArrays as long as I always keep around the original reference? E.g. like this:

struct MyVector{T} <: Vector{T}
   _A::Vector{UInt8} 
   A::Vector{T}
end

And then still in the end: isn’t the incredibly poor performance of reinterpreted arrays on J1.8 strange when bounds-checking is enabled?

I think as long you keep the reference to the original Vector you are safe to use both, the GC will not free the memory exactly because you keep the original reference.

this is truly unsafe and best performance because it gives you a native array

julia> unsafe_arraycast(Float64, rand(UInt8, 64))
8-element Vector{Float64}:
 -6.079564859434036e242
  2.8652427427119243e252
 -7.940145865008032e-108
  8.320185091615792e-8
  5.427223515701773e188
  9.184586067914578e-204
  2.342950753478369e-31
  2.653381005394009e266

@jling - thank you. And same principle - I need to keep the reference to the original array?

nope,

jl_reshape_array

handles that

I think no, as the reference to the memory is the same. You can check it with

julia> function unsafe_arraycast(::Type{D}, ary::Vector{S}) where {S, D}
           l = sizeof(S)*length(ary)÷sizeof(D)
           res = ccall(:jl_reshape_array, Vector{D}, (Any, Any, Any), Vector{D}, ary, (l,))
           return res
       end
unsafe_arraycast (generic function with 1 method)

julia> A = zeros(UInt8, 10 * sizeof(Float64));

julia> B = unsafe_arraycast(Float64, A);

julia> pointer(A)
Ptr{UInt8} @0x00007f744d5eba28

julia> pointer(B)
Ptr{Float64} @0x00007f744d5eba28

That’s really nice - thanks for the suggestion

Why do you label it unsafe then?

because it is super unsafe and Julia devs strongly against even having this as unsafe_* function in the base.

Notice this doesn’t work before 1.7 and is likely to break again in the future when jl_reshape_array changes

In what sense is it “super unsafe”? Is if because something internals of Julia? Memory aliasing with different types? :thinking: :confused:

Weird, I’m testing on v1.7 and it looks to work just fine. :sweat_smile:

before 1.7 means 1.6 doesn’t work

we should ask @jameson I guess

Sorry, misread. :sweat_smile:

Oky, doky. :eyes: