Performance issue with broadcasting of custom array wrapper wrapping a CuArray

Sleort · December 31, 2019, 7:43am

Background

I’m trying to make my own array (wrapper) type that can be used both on CPUs and GPUs. Here’s a minimal example:

struct MyWrapper{T,N,A<:AbstractArray{T,N}} <: AbstractArray{T,N}
    data::A
end

# Basic AbstractArray methods:
for f ∈ (:size, :getindex, :setindex!)
    @eval Base.$f(x::MyWrapper, args...) = $f(x.data, args...)
end

# Broadcasting: 
# (based on the ArrayAndChar example at https://docs.julialang.org/en/latest/manual/interfaces/#man-interfaces-broadcasting-1 )
import Base.Broadcast: ArrayStyle, Broadcasted
Base.BroadcastStyle(::Type{<:MyWrapper}) = ArrayStyle{MyWrapper}()
function Base.similar(bc::Broadcasted{ArrayStyle{MyWrapper}}, ::Type{T}) where {T}
    x = find_mw(bc)
    MyWrapper(similar(x.data, T, axes(bc)))
end

find_mw(bc::Broadcasted) = find_mw(bc.args)
find_mw(args::Tuple) = find_mw(find_mw(args[1]), Base.tail(args))
find_mw(x) = x
find_mw(a::MyWrapper, rest) = a
find_mw(::Any, rest) = find_mw(rest)

The issue

Now broadcasting of MyWrapper seems to work on my CPU (although with an additional allocation and somewhat slower than the “unwrapped” array…)

using BenchmarkTools
x = rand(1000000)
y = MyWrapper(x)
@btime 1 .* $x # 712.590 μs (2 allocations: 7.63 MiB)
@btime 1 .* $y # 780.263 μs (3 allocations: 7.63 MiB)

However, if I attempt to wrap a CuArray and do the same on my GPU, I get

using CuArrays
x = cu(rand(1000000))
y = MyWrapper(x)
@btime 1 .* $x # 5.639 μs (65 allocations: 2.27 KiB)
@btime 1 .* $y # 7.387 s (2000009 allocations: 183.11 MiB)

which doesn’t look good for MyWrapper!

I suspect I’m missing something “trivial” (like overloading copyto! for MyWrapper?), but haven’t really been able to figure out exactly what. Suggestions?

Additional information (although I doubt that matters):

julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

(v1.3) pkg> st CuArrays
    Status `~/.julia/environments/v1.3/Project.toml`
  [79e6a3ab] Adapt v1.0.0
  [fa961155] CEnum v0.2.0
  [3895d2a7] CUDAapi v2.1.0
  [c5f51814] CUDAdrv v5.0.1
  [be33ccc6] CUDAnative v2.7.0
  [3a865a2d] CuArrays v1.6.0
  [864edb3b] DataStructures v0.17.6
  [872c559c] NNlib v0.6.2

maleadt · December 31, 2019, 10:08am

Did you see any warning about scalar iteration? You should always disable that first using CuArrays.allowscalar(false). Know that array wrappers are tricky and will make it much harder to dispatch to GPU-optimized implementations. With Broadcast it’s possible to fix this by setting-up the proper array style, but other methods (think fill, reshape, view) will now dispatch to the slow AbstractArray fallbacks and not the fast GPU implementations.

Sleort · December 31, 2019, 10:30am

Hmm… I do indeed get a “scalar getindex is disallowed” error when setting CuArrays.allowscalar(false), even when not displaying the output. (I got a warning about scalar iteration, but assumed, incorrectly, that it was only due to showing in Juno…)

Is the takeaway message then that one should in general, if possible, avoid making ones own array types if the program is to run efficiently (also) on GPUs?

maleadt · December 31, 2019, 12:43pm

That would be the easiest solution, yes. We have to jump through quite some hoops to make common wrappers (Transpose, Adjoint, Reshape, SubArray, etc) work somewhat decently, so you’d need to do the same if you want to work with your own wrappers.

datnamer · December 31, 2019, 6:23pm

Can this be improved with cassette? Or is there a longer term plan for a general solution to make wrapped arrays work on GPU?

maleadt · December 31, 2019, 6:41pm

There’s been some developments, Code using Flux slow on GPU - #10 by maleadt, but I’m not sure if anybody’s actively working on it. For now, Adapt.jl plus some quirks in CuArrays.jl/GPUArrays.jl covers most uses (except for custom wrappers, and using multiple wrappers).

Topic		Replies	Views
Best way to deal with broadcasting of intrinsics on CuArrays? General Usage	4	985	February 7, 2019
cuArrays vs CUDANative GPU	3	1385	November 14, 2018
CuArrays not working as expected when broadcasting a function Machine Learning gpu	7	1465	December 1, 2019
CUDA CPU allocations with range General Usage cuda	5	844	January 13, 2022
Passing a wrapped array to a kernel GPU	2	557	May 27, 2020

Performance issue with broadcasting of custom array wrapper wrapping a CuArray

Background

The issue

Related topics