Error when mapping a CuVector{UnitRange} to index another gpu Array

I’m running into an error while trying to use a CuVector{UnitRange} as an index to map another CuArray. I’m not sure if it’s a bug or not.

I think I can explain it better with this example:

using CUDA
CUDA.allowscalar(false)

# example: sum only a part of an array
rangesum(x, r::UnitRange) = sum(x[r])

# broadcast version
rangesum(x, rr::AbstractVector) = map(r -> rangesum(x, r), rr)

x = collect(1:100) |> cu
r = [1:10, 33:37, 50:80]

# this works fine, but does a normal cpu map
rangesum(x, r)

# but fails if we convert r into a gpu vector
rangesum(x, cu(r))

The last line throws:

ERROR: InvalidIRError: compiling kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to print_to_string(xs...) in Base at strings/io.jl:124)
Stacktrace
Stacktrace:
  [1] string
    @ ./strings/io.jl:174
  [2] throw_checksize_error
    @ ./multidimensional.jl:881
  [3] _unsafe_getindex
    @ ./multidimensional.jl:845
  [4] _getindex
    @ ./multidimensional.jl:832
  [5] getindex
    @ ./abstractarray.jl:1170
  [6] rangesum
    @ ./REPL[3]:2
  [7] #1
    @ ./REPL[4]:2
  [8] _broadcast_getindex_evalf
    @ ./broadcast.jl:648
  [9] _broadcast_getindex
    @ ./broadcast.jl:621
 [10] getindex
    @ ./broadcast.jl:575
 [11] broadcast_kernel
    @ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:59
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
  [1] Array
    @ ./boot.jl:448
  [2] Array
    @ ./boot.jl:457
  [3] similar
    @ ./abstractarray.jl:750
  [4] similar
    @ ./abstractarray.jl:740
  [5] _unsafe_getindex
    @ ./multidimensional.jl:844
  [6] _getindex
    @ ./multidimensional.jl:832
  [7] getindex
    @ ./abstractarray.jl:1170
  [8] rangesum
    @ ./REPL[3]:2
  [9] #1
    @ ./REPL[4]:2
 [10] _broadcast_getindex_evalf
    @ ./broadcast.jl:648
 [11] _broadcast_getindex
    @ ./broadcast.jl:621
 [12] getindex
    @ ./broadcast.jl:575
 [13] broadcast_kernel
    @ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:59
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/9rK1I/src/validation.jl:111
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/9rK1I/src/driver.jl:333 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/9rK1I/src/driver.jl:331 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/9rK1I/src/utils.jl:62
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:326
  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/9rK1I/src/cache.jl:89
  [8] cufunction(f::GPUArrays.var"#broadcast_kernel#17", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:297
  [9] cufunction(f::GPUArrays.var"#broadcast_kernel#17", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}})
    @ CUDA ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:291
 [10] macro expansion
    @ ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:102 [inlined]
 [11] #launch_heuristic#234
    @ ~/.julia/packages/CUDA/Xt3hr/src/gpuarrays.jl:17 [inlined]
 [12] copyto!
    @ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:65 [inlined]
 [13] copyto!
    @ ./broadcast.jl:936 [inlined]
 [14] copy
    @ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:47 [inlined]
 [15] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, var"#1#2"{CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}}, Tuple{CuArray{UnitRange{Int64}, 1, CUDA.Mem.DeviceBuffer}}})
    @ Base.Broadcast ./broadcast.jl:883
 [16] map(::Function, ::CuArray{UnitRange{Int64}, 1, CUDA.Mem.DeviceBuffer})
    @ GPUArrays ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:90
 [17] rangesum(x::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, rr::CuArray{UnitRange{Int64}, 1, CUDA.Mem.DeviceBuffer})
    @ Main ./REPL[4]:2
 [18] top-level scope
    @ REPL[8]:2
 [19] top-level scope
    @ ~/.julia/packages/CUDA/Xt3hr/src/initialization.jl:52

My intention with this is to do the entire map on the GPU, in order to use the result in further GPU computations. Shouldn’t this be possible?

Why do you convert the range into array? Instead of r = [1:10] you can simply write r = 1:10 and then use it in both cases without any conversion to CuArray. If you need more complicated indexing pattern, you can use CartesianIndices.

Because I’ll receive a list of ranges in order to do computations in segments of the source array. Something like r = [1:10, 12:13, 17:21] etc. I’ll update my example and take a look at CartesianIndices.

Actually, I just noticed it works if we do the operation on a view instead of a slice:

julia> using CUDA; CUDA.allowscalar(false)

julia> test(x, r::UnitRange) = sum(view(x, r))
test (generic function with 1 method)

julia> test(x, rr::AbstractVector) = map(r->test(x,r), rr)
test (generic function with 2 methods)

julia> x = collect(1:10000) |> cu;

julia> r = [1:10, 20:100, 130:150]
3-element Vector{UnitRange{Int64}}:
 1:10
 20:100
 130:150

julia> test(x, r)
3-element Vector{Int64}:
   55
 4860
 2940

julia> test(x, cu(r))
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
   55
 4860
 2940

I suppose this is a bug?

With CartesianIndices you can do the following

r1 = CartesianIndices((1:10,))
r2 = CartesianIndices((12:13,))
r3 = CartesianIndices((17:21,))
r = vcat(r1, r2, r3)

Then you can use r as a simple range.

This seems to flatten the indices into a single range. My intention is to apply the operation once per range. I don’t think CartesianIndices is necessary here.

Just some missing functionality: Indexing a CuDeviceArray with a UnitRange isn’t implemented, and triggers GPU-incompatible functionality. AFAIK view with UnitRange isn’t specialized either, but the generic functionality from Base apparently isn’t doing anything GPU incompatible.

How hard would it be to add support for this?

Unfortunately, it seems even the view method does not work with more complex operations involving array allocations/copying.

Thanks a lot.

I’ve opened a feature request issue in Github for this: