How to use OffsetArray with CUDA

0samuraiE · August 20, 2024, 2:07am

Hello all.
In the following code, sum returns ERROR: Scalar indexing is disallowed.
How should sum be dispatched or How should I do?

A = OffsetArray(CUDA.rand(10), -4:5);
a = @views A[-1:2];
sum(a)

0samuraiE · August 20, 2024, 3:01am

A simple implementation is shown. But I think there must be a better implementation.

Base.sum(A::OffsetArray{T,N,CuArray{T,N,M}}) where {T,N,M} = sum(parent(A))
function Base.sum(A::SubArray{T,N,OffsetArray{T,N,CuArray{T,N,M}}}) where {T,N,M}
    indices = A.indices
    offsets = parent(A).offsets
    sum(@view parent(parent(A))[CartesianIndices(ntuple(n -> indices[n] .- offsets[n], N))])
end

roflmaostc · August 20, 2024, 11:10am

I think OffsetArray is not quite compatible with CUDA naively, even broadcasting fails:

julia> A = OffsetArray(CUDA.rand(10), -4:5);

julia> sum(A)
ERROR: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] errorscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:155
  [3] _assertscalar(op::String, behavior::GPUArraysCore.ScalarIndexing)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:128
  [4] assertscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:116
  [5] getindex
    @ ~/.julia/packages/GPUArrays/bbZD0/src/host/indexing.jl:50 [inlined]
  [6] getindex
    @ ~/.julia/packages/OffsetArrays/hwmnB/src/OffsetArrays.jl:438 [inlined]
  [7] _mapreduce(f::typeof(identity), op::typeof(Base.add_sum), ::IndexLinear, A::OffsetVector{Float32, CuArray{…}})
    @ Base ./reduce.jl:438
  [8] _mapreduce_dim
    @ ./reducedim.jl:365 [inlined]
  [9] mapreduce
    @ ./reducedim.jl:357 [inlined]
 [10] _sum
    @ ./reducedim.jl:1015 [inlined]
 [11] _sum
    @ ./reducedim.jl:1014 [inlined]
 [12] sum(a::OffsetVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}})
    @ Base ./reducedim.jl:1010
 [13] top-level scope
    @ REPL[15]:1
 [14] top-level scope
    @ ~/.julia/packages/CUDA/htRwP/src/initialization.jl:206
Some type information was truncated. Use `show(err)` to see complete types.

julia> A .+ A
ERROR: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.

If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] errorscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:155
  [3] _assertscalar(op::String, behavior::GPUArraysCore.ScalarIndexing)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:128
  [4] assertscalar(op::String)
    @ GPUArraysCore ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:116
  [5] getindex
    @ ~/.julia/packages/GPUArrays/bbZD0/src/host/indexing.jl:50 [inlined]
  [6] getindex
    @ ~/.julia/packages/OffsetArrays/hwmnB/src/OffsetArrays.jl:438 [inlined]
  [7] _broadcast_getindex
    @ ./broadcast.jl:675 [inlined]
  [8] _getindex
    @ ./broadcast.jl:705 [inlined]
  [9] _broadcast_getindex
    @ ./broadcast.jl:681 [inlined]
 [10] getindex
    @ ./broadcast.jl:636 [inlined]
 [11] macro expansion
    @ ./broadcast.jl:1004 [inlined]
 [12] macro expansion
    @ ./simdloop.jl:77 [inlined]
 [13] copyto!
    @ ./broadcast.jl:1003 [inlined]
 [14] copyto!
    @ ./broadcast.jl:956 [inlined]
 [15] copy
    @ ./broadcast.jl:928 [inlined]
 [16] materialize(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{…}, Nothing, typeof(+), Tuple{…}})
    @ Base.Broadcast ./broadcast.jl:903
 [17] top-level scope
    @ REPL[16]:1
 [18] top-level scope
    @ ~/.julia/packages/CUDA/htRwP/src/initialization.jl:206
Some type information was truncated. Use `show(err)` to see complete types.

julia> A
10-element OffsetArray(::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, -4:5) with eltype Float32 with indices -4:5:
 0.19389287
 0.71536046
 0.8175033
 0.57097757
 0.058391646
 0.72871023
 0.6042697
 0.88648033
 0.76349247
 0.9775343

There is some hints here if it might help:

maleadt · August 20, 2024, 11:49am

Yeah, Julia isn’t currently great wrt. wrapped arrays and preserving functionality from the contained array type where needed. I typically link to Use with multiple wrappers · Issue #21 · JuliaGPU/Adapt.jl · GitHub for this, and this would need some work in Base to resolve (e.g., AbstractWrappedArray, or another approach for wrapped array identification · Issue #51910 · JuliaLang/julia · GitHub). We try to support Base’s array wrappers as much as possible, and for other types like OffsetArray a package extension that fixes or overrides dispatch where needed could be added.

If you simply want compatibility (i.e., without triggering scalar indexing errors, but also without executing on the GPU) you can use unified memory, see CUDA.jl 5.4: Memory management mayhem ⋅ JuliaGPU

0samuraiE · August 20, 2024, 3:01pm

Thank you both. So, as it stands, we need to define this roundabout wrapper for ourselves.

unified memory seems not to support float64.
adapt seems not to support broadcast or reducemap.

maleadt · August 20, 2024, 3:36pm

That is not the case. Can you share what you are running into?

0samuraiE · August 20, 2024, 11:18pm

I tried

A = CUDA.zeros(10)
A = cu(A, unified=true)

Maybe CuArray{Float64,1,CUDA.UnifiedMemory} works? but I cannot try yet.

Additionally it seems that unified memory is allocated on cpu memory. Is this cause any performance issues?

maleadt · August 21, 2024, 7:17am

The cu function is to be used with CPU inputs; It’s a user-friendly constructor.

0samuraiE · August 21, 2024, 11:58pm

Thank you. I understand.

Topic		Replies	Views
Supporting offset arrays without explicit dependency General Usage question , arrays	13	512	May 24, 2022
Overcoming Slow Scalar Operations on GPU Arrays GPU performance	19	6233	January 4, 2021
Rotr90 of a CUDA.CuArray GPU	5	333	October 6, 2022
GPU problems with RecursiveArrayTools GPU question , gpuarrays , differentialequation	1	212	February 28, 2024
Reducing sum of OffsetArray General Usage	2	454	March 27, 2020

How to use OffsetArray with CUDA

Related topics