Help understanding failed optimization when defining Base.similar for AbstractArray interface


#1

Hello,

I encountered this behavior that I don’t understand. I am defining a wrapper around an array and using the AbstractArray interface:

abstract type aw{T,N} <: AbstractArray{T,N} end
struct wrap2{T,N} <:  aw{T,N}
    v::Array{T,N}
end
Base.size(w::wrap2)=size(w.v)
@inline Base.getindex(w::wrap2,i)=(@boundscheck checkbounds(w,i);@inbounds w.v[i])
@inline Base.setindex!(w::wrap2,v,i)=(@boundscheck checkbounds(w,i);@inbounds w.v[i]=v)
#@inline Base.similar(w::wrap2,::Type{T},dims::Dims{N}) where {T,N}=wrap2{T,N}(similar(w.v,T,dims))

Now, I define a function that can be optimized to a no-op:

bb(x)=(x.=x.*1.0)

rrr=rand(10^7);wwr=wrap2(copy(rrr));
@time bb(rrr);
@time bb(wwr);
  0.000002 seconds (4 allocations: 160 bytes)
  0.000001 seconds (3 allocations: 144 bytes)

The compiler is smart enough to optimize this function call (timing is independent of the size.

However, if I uncomment the definition of similar above, The optimization does not happen for wrap2. I cannot understand why, and similar is never called!

I looked at the output of code_typed (don’t trust me, I am no expert on this) and the only place where I saw some significant difference is:

With similar defined:

│││││││││             18 ─ %49  = invoke Base.unaliascopy(_2::wrap2{Float64,1})::wrap2{Float64,1}
│││││││││             └───        goto #20

Without similar:

│││││││││             18 ─        invoke Base.unaliascopy(_2::wrap2{Float64,1})::Union{}
│││││││││             └───        $(Expr(:unreachable))::Union{}

Can someone please help me understand what is going on, and how to fix it?

Thanks!


#2

Interesting. I don’t have the full story here yet, but the difference in IR does make sense. In an expression like:

x .= @view x[end:-1:1]

Julia’s broadcasting infrastructure will detect that the right- and left-hand sides alias in a manner that would cause incorrect results when it goes to copy the elements iteratively. To prevent such incorrect results, it’ll make a copy of the array on the RHS before modifying the LHS. This is done by unaliascopy. Now there’s a problem here: we make that copy of the array on the RHS with a plain-old copy(rhs) command, and copy uses similar to allocate the new container. If you don’t specialize similar yourself, then you’ll get the fallback generic implementation that just returns an Array. If that were the end of it, it’d mean that every single broadcast operation you perform without a similar definition would be type-unstable! That’d be quite a performance speed-bump that we’d have to pay everywhere just to support the fairly rare case where a broadcasting expression has an out-of-order aliasing RHS and LHS. So the unaliasing machinery watches out for this — and instead of returning a type-unstable value in such a case, it’ll error if it detects that copy(A) is not a typeof(A).

julia> abstract type aw{T,N} <: AbstractArray{T,N} end
       struct wrap2{T,N} <:  aw{T,N}
           v::Array{T,N}
       end
       Base.size(w::wrap2)=size(w.v)
       @inline Base.getindex(w::wrap2,i)=(@boundscheck checkbounds(w,i);@inbounds w.v[i])
       @inline Base.setindex!(w::wrap2,v,i)=(@boundscheck checkbounds(w,i);@inbounds w.v[i]=v)

julia> A = wrap2([1,2,3])
3-element wrap2{Int64,1}:
 1
 2
 3

julia> A .= @view A[end:-1:1]
ERROR: ArgumentError: an array of type `wrap2` shares memory with another argument and must
make a preventative copy of itself in order to maintain consistent semantics,
but `copy(A)` returns a new array of type `Array{Int64,1}`. To fix, implement:
    `Base.unaliascopy(A::wrap2)::typeof(A)`

So that’s why (without a similar definition) unaliascopy will throw an error and Julia will infer that everything that follows is unreachable. The expression you’re benchmarking — x.=x.*1.0 — does not have this sort of out-of-order aliasing, but the check for this happens at run-time and so there’s still a branch to check for that situation.

Now as for why this leads to a better optimization, I imagine there’s some subsequent dead code elimination that’s going on that allows for Julia/LLVM to see the operation a bit more clearly, but I’ve not dug in further. Often these sorts of complicated O(1) transformations lie on the knife’s edge of feasibility and small perturbations to the code can throw a wrench into things.