Big overhead with the new lazy reshape/reinterpret

I’m finding important performance regressions when indexing into the new reshape/reinterpret wrappers. (This is causing problems in trying to update NearestNeighbors.jl)

Current v0.7 master

julia> using StaticArrays
julia> a = rand(3, 10^4); b = reshape(reinterpret(SVector{3, Float64}, a), (size(a, 2),))
julia> @btime $b[5][2]
  18.164 ns (0 allocations: 0 bytes)

v0.6.1

julia> using StaticArrays
julia> a = rand(3, 10^4); b = reinterpret(SVector{3, Float64}, a, (size(a, 2),));
julia> f(c) = c[5][2]
julia> @btime $b[5][2]
  1.346 ns (0 allocations: 0 bytes)

What am I doing wrong? Is this temporary, perhaps, waiting for some other PR?

3 Likes

This will hurt me too very soon.

2 Likes

I confess I gave up, and resorted to copying instead of reinterpreting…:pensive:. Attempting to improve this issue is still way above my head.

maybe it is worth to open an issue on github

Tim Holy already did so: https://github.com/JuliaLang/julia/issues/25014

Cheers,
Kevin

1 Like

Have you tried the following:

assert(typeof(a)==Array{Float64, 2})
assert(3==size(a,1))
b=unsafe_wrap(Array, reinterpret(Ptr{SVector{3, Float64}},pointer(a)), (size(a,2),), false); 

Of course this is a dirty workaround and requires you to possibly @gc_preserve a.

Thanks for the tip @foobar_lv2, this indeed works as fast as before! Is this equivalent to using the old version of reinterpret? If I do a @gc_preserve a, will a never be garbage-collected even if we exit the scope of b?

I will mark this as a solution to the question, although being somewhat of a hack I’m not sure it is really acceptable as a long term solution for the NearestNeighbor.jl package. I’ll ping @kristoffer.carlsson to see what he thinks…

Well, it circumvents the reason ReinterpretedArray was introduced at all (TBAA) so we will effectively lie to the compiler. I don’t know the internals well enough to predict what effect this will (might) have.

If I do a @gc_preserve a, will a never be garbage-collected even if we exit the scope of b?

b does not prevent a from being garbage collected. If you use b after a gets garbage collected, too bad.

@gc_preserve a begin ... end prevents a from being collected as long as the begin… end block runs.

Realistically, you probably can store a long-time reference to a somewhere (in the root-node of your tree?) and never expose the unsafe_wrapped b to users. Then you are totally fine.

Example (as far as I understood, anyone please correct me if I’m wrong here):

function foo_wrong(n)
a = rand(3, n)
b=unsafe_wrap(Array, reinterpret(Ptr{SVector{3, Float64}},pointer(a)), (size(a,2),), false)
return sum(b)
end

function foo_notwrong(n)
a = rand(3, n)
b=unsafe_wrap(Array, reinterpret(Ptr{SVector{3, Float64}},pointer(a)), (size(a,2),), false) 
@Base.gc_preserve a begin s=sum(b) end
return s
end

Yes, that was my impression. I think Keno had good technical reasons (requirements for future progress with the compiler) to do the ReinterpretArray changes.

Thanks, understood!

To add even more details:

Garbage collection does not respect scopes. “Going out of scope” is the notion of visibility described in the docs; in reality, an object can be collected once the compiler believes that it will not be accessed anymore (and the compiler does reorder instructions!).

The compiler gets more and more clever, so all these pointer tricks are somewhat dangerous. I personally tend to stash away references in some very gc visible mutable place that is reachable from the user, in order to not have to think about what exactly the compiler will infer (or for very short-lived objects, @gc_preserve).

Also, the “false” for the unsafe_wrap is essential (you don’t want no double free).

1 Like

Still seeing this problem, although not quite as badly as in @pablosanjose’s example.

I’m getting a median time of 2.6 ns for accessing an array created with unsafe_wrap, and a median time of 10.0 ns for a reinterpreted array.

I was wondering if there has been any further progress or investigation on this?

I’m getting some segfaults I can’t figure out so more than ever I’m itching to be able to use reinterpret. I’ve had a very hard time figuring out what about reinterpret is even slow. Sometimes accessing individual elements seems perfectly fine, but then I’ll run it through a function or something and it’s inexplicably slow.

I’m also interested to know. Just two days ago I checked out master to try whether there was any progress using the original example I posted, and I didn’t see any improvement yet.

+1

Still no improvements on master:

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.7.0-DEV.5152 (2018-05-21 21:19 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit dc30e38 (0 days old master)
|__/                   |  x86_64-linux-gnu

julia> using StaticArrays

julia> m = rand(3, 1_000_000);

julia> v = reinterpret(SVector{3,Float64}, m, (1_000_000,));
┌ Warning: `reinterpret(::Type{T}, a::Array{S}, dims::NTuple{N, Int}) where {T, S, N}` is deprecated, use `reshape(reinterpret(T, vec(a)), dims)` instead.
│   caller = top-level scope
└ @ Core :0

julia> sum(v);

julia> @time sum(v);
  0.050470 seconds (9 allocations: 320 bytes)

julia> typeof(v)
Base.ReinterpretArray{SArray{Tuple{3},Float64,1,3},1,Float64,Array{Float64,1}}

versus v0.6.2

  _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-linux-gnu

julia> using StaticArrays

julia> m = rand(3, 1_000_000);

julia> v = reinterpret(SVector{3,Float64}, m, (1_000_000,));

julia> sum(v);

julia> @time sum(v);
  0.004232 seconds (84 allocations: 6.123 KiB)

julia> typeof(v)
Array{StaticArrays.SArray{Tuple{3},Float64,1,3},1}
1 Like

One interesting thing, the allocations have gone way down, 84 → 9, 6.123Kib → 320 bytes, even though its still about 12x slower.
Have you been profiling this?

The allocations are spurious (due to partial compilation stuff?)

julia> @btime sum($v);#on 0.62
  2.106 ms (0 allocations: 0 bytes)

I think the proper solution is a Base.unsafe_reinterpret that implements the old reinterpret (i.e. the return type does not know that the array is reinterpreted).

I think the reason for the new reinterpret is that some aliasing assumptions changed, which now makes it unsafe to use both A and unsafe_reinterpret(T,A,...) = unsafe_wrap(Array, convert(T, pointer(X)), ...) in the same loop. In 99% of the cases, you don’t do that and can impose a @noinline function boundary between the loops (hence the fact that we lie to llvm about aliasing does not matter).

At some point I looked carefully at the ReinterpretArray code, there is a lot going on during getindex calls. I suspect some of this will need to be rethought. getindex really needs to be a no-op.

I suspect unless someone wants to take this on in earnest now we will be stuck with unsafe_wrap for 1.0.