Big overhead with the new lazy reshape/reinterpret

If I do a @gc_preserve a, will a never be garbage-collected even if we exit the scope of b?

b does not prevent a from being garbage collected. If you use b after a gets garbage collected, too bad.

@gc_preserve a begin ... end prevents a from being collected as long as the begin… end block runs.

Realistically, you probably can store a long-time reference to a somewhere (in the root-node of your tree?) and never expose the unsafe_wrapped b to users. Then you are totally fine.

Example (as far as I understood, anyone please correct me if I’m wrong here):

function foo_wrong(n)
a = rand(3, n)
b=unsafe_wrap(Array, reinterpret(Ptr{SVector{3, Float64}},pointer(a)), (size(a,2),), false)
return sum(b)
end

function foo_notwrong(n)
a = rand(3, n)
b=unsafe_wrap(Array, reinterpret(Ptr{SVector{3, Float64}},pointer(a)), (size(a,2),), false) 
@Base.gc_preserve a begin s=sum(b) end
return s
end

Yes, that was my impression. I think Keno had good technical reasons (requirements for future progress with the compiler) to do the ReinterpretArray changes.

Thanks, understood!

To add even more details:

Garbage collection does not respect scopes. “Going out of scope” is the notion of visibility described in the docs; in reality, an object can be collected once the compiler believes that it will not be accessed anymore (and the compiler does reorder instructions!).

The compiler gets more and more clever, so all these pointer tricks are somewhat dangerous. I personally tend to stash away references in some very gc visible mutable place that is reachable from the user, in order to not have to think about what exactly the compiler will infer (or for very short-lived objects, @gc_preserve).

Also, the “false” for the unsafe_wrap is essential (you don’t want no double free).

1 Like

Still seeing this problem, although not quite as badly as in @pablosanjose’s example.

I’m getting a median time of 2.6 ns for accessing an array created with unsafe_wrap, and a median time of 10.0 ns for a reinterpreted array.

I was wondering if there has been any further progress or investigation on this?

I’m getting some segfaults I can’t figure out so more than ever I’m itching to be able to use reinterpret. I’ve had a very hard time figuring out what about reinterpret is even slow. Sometimes accessing individual elements seems perfectly fine, but then I’ll run it through a function or something and it’s inexplicably slow.

I’m also interested to know. Just two days ago I checked out master to try whether there was any progress using the original example I posted, and I didn’t see any improvement yet.

+1

Still no improvements on master:

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.7.0-DEV.5152 (2018-05-21 21:19 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit dc30e38 (0 days old master)
|__/                   |  x86_64-linux-gnu

julia> using StaticArrays

julia> m = rand(3, 1_000_000);

julia> v = reinterpret(SVector{3,Float64}, m, (1_000_000,));
┌ Warning: `reinterpret(::Type{T}, a::Array{S}, dims::NTuple{N, Int}) where {T, S, N}` is deprecated, use `reshape(reinterpret(T, vec(a)), dims)` instead.
│   caller = top-level scope
└ @ Core :0

julia> sum(v);

julia> @time sum(v);
  0.050470 seconds (9 allocations: 320 bytes)

julia> typeof(v)
Base.ReinterpretArray{SArray{Tuple{3},Float64,1,3},1,Float64,Array{Float64,1}}

versus v0.6.2

  _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-linux-gnu

julia> using StaticArrays

julia> m = rand(3, 1_000_000);

julia> v = reinterpret(SVector{3,Float64}, m, (1_000_000,));

julia> sum(v);

julia> @time sum(v);
  0.004232 seconds (84 allocations: 6.123 KiB)

julia> typeof(v)
Array{StaticArrays.SArray{Tuple{3},Float64,1,3},1}
1 Like

One interesting thing, the allocations have gone way down, 84 → 9, 6.123Kib → 320 bytes, even though its still about 12x slower.
Have you been profiling this?

The allocations are spurious (due to partial compilation stuff?)

julia> @btime sum($v);#on 0.62
  2.106 ms (0 allocations: 0 bytes)

I think the proper solution is a Base.unsafe_reinterpret that implements the old reinterpret (i.e. the return type does not know that the array is reinterpreted).

I think the reason for the new reinterpret is that some aliasing assumptions changed, which now makes it unsafe to use both A and unsafe_reinterpret(T,A,...) = unsafe_wrap(Array, convert(T, pointer(X)), ...) in the same loop. In 99% of the cases, you don’t do that and can impose a @noinline function boundary between the loops (hence the fact that we lie to llvm about aliasing does not matter).

At some point I looked carefully at the ReinterpretArray code, there is a lot going on during getindex calls. I suspect some of this will need to be rethought. getindex really needs to be a no-op.

I suspect unless someone wants to take this on in earnest now we will be stuck with unsafe_wrap for 1.0.

No, I think array.c should do all the thinking for reinterpreting arrays in some Base.unsafe_reinterpret. That is, this kind of reinterpret is the job for someone who knows all the array flags, all the multi-threaded and multiprocessor special cases, who knows which array to lock when, etc. Think of the following:

julia> A=[1,2,3];
julia> B=reinterpret(UInt64, A);
julia> push!(A, 1);
ERROR: cannot resize array with shared data
julia> push!(B, 1);

I am sceptical that allowing push! to B is a good idea. But some author of array.c thought about this, and every end-user resorting to pointer games is definitely worse than core people giving us the best and safest thing that can be done with zero-overhead, and all the segfaults we asked for when zero-overhead is not possible.

The amount of code is not always a good sign of how much will happen in the runtime. It has been confirmed that this all used to optimize away but now fails to do so.

1 Like

Fair enough, but I remember coming to the conclusion that in many cases this would not be a no-op (it’s been quite a while since I’ve looked at it carefully, so maybe I don’t know what I’m talking about).

I don’t ever remember a time when there weren’t problems with this though, at what point did it get broken?

Even if true, this does not make a Base.unsafe_reinterpret obsolete. Alone the fact that
y? reinterpret(some_T, x): something::Array{some_T} is now type-unstable makes it necessary to sometimes really reinterpret memory regions, not julia objects. And the old reinterpret code is perfectly fine for this job, and so much better than unsafe_wrap hacks that will break on first contact with arrays that have nonzero offset.

Having an unsafe_reinterpret that is basically just a safe copy of an unsafe_wraped array does seem like it would be a useful thing. Basically my current workaround of the reinterpret slowness is to do this by default, but also to allow views of unsafe_wraped arrays when it is guaranteed safe.

After you unsafe_wrap an array it is indistinguishable from a “proper” array. That’s what the old reinterpret does. But a one-liner that grabs the pointer and unsafe wraps it is not optimal: In principle you should handle offsets, lock arrays against resizing and possibly handle other array flags. Views of unsafe_wrap arrays are OK by default. Consider:

julia> zz=[];
       for i=1:50_000
       n=rand(1:30)
       A=zeros(UInt32,n*n+1);
       deleteat!(A,1);
       #B=unsafe_wrap(Array,pointer(A),(n,n))
       B=reinterpret(UInt32,A,(n,n))
       push!(zz,B);
       (i%1000 == 0)&& gc()
       end;
julia> all(iszero.(zz))
true

julia> zz=[];
       for i=1:50_000
       n=rand(1:30)
       A=zeros(UInt32,n*n+1);
       deleteat!(A,1);
       B=unsafe_wrap(Array,pointer(A),(n,n))
       #B=reinterpret(UInt32,A,(n,n))
       push!(zz,B);
       (i%1000 == 0)&& gc()
       end;

julia> all(iszero.(zz))
false

edit: So we see that we corrupted memory in the second variant (all on 0.62). Expecting users to handle these subtleties is slightly harsh, imo.

https://github.com/JuliaLang/julia/pull/27213

19 Likes

You are the man! Thank you so much Keno!