Error when mapping a CuVector{UnitRange} to index another gpu Array

ogoid · October 27, 2021, 2:52am

I’m running into an error while trying to use a CuVector{UnitRange} as an index to map another CuArray. I’m not sure if it’s a bug or not.

I think I can explain it better with this example:

using CUDA
CUDA.allowscalar(false)

# example: sum only a part of an array
rangesum(x, r::UnitRange) = sum(x[r])

# broadcast version
rangesum(x, rr::AbstractVector) = map(r -> rangesum(x, r), rr)

x = collect(1:100) |> cu
r = [1:10, 33:37, 50:80]

# this works fine, but does a normal cpu map
rangesum(x, r)

# but fails if we convert r into a gpu vector
rangesum(x, cu(r))

The last line throws:

ERROR: InvalidIRError: compiling kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to print_to_string(xs...) in Base at strings/io.jl:124)

Stacktrace

Stacktrace:
  [1] string
    @ ./strings/io.jl:174
  [2] throw_checksize_error
    @ ./multidimensional.jl:881
  [3] _unsafe_getindex
    @ ./multidimensional.jl:845
  [4] _getindex
    @ ./multidimensional.jl:832
  [5] getindex
    @ ./abstractarray.jl:1170
  [6] rangesum
    @ ./REPL[3]:2
  [7] #1
    @ ./REPL[4]:2
  [8] _broadcast_getindex_evalf
    @ ./broadcast.jl:648
  [9] _broadcast_getindex
    @ ./broadcast.jl:621
 [10] getindex
    @ ./broadcast.jl:575
 [11] broadcast_kernel
    @ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:59
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
  [1] Array
    @ ./boot.jl:448
  [2] Array
    @ ./boot.jl:457
  [3] similar
    @ ./abstractarray.jl:750
  [4] similar
    @ ./abstractarray.jl:740
  [5] _unsafe_getindex
    @ ./multidimensional.jl:844
  [6] _getindex
    @ ./multidimensional.jl:832
  [7] getindex
    @ ./abstractarray.jl:1170
  [8] rangesum
    @ ./REPL[3]:2
  [9] #1
    @ ./REPL[4]:2
 [10] _broadcast_getindex_evalf
    @ ./broadcast.jl:648
 [11] _broadcast_getindex
    @ ./broadcast.jl:621
 [12] getindex
    @ ./broadcast.jl:575
 [13] broadcast_kernel
    @ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:59
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/9rK1I/src/validation.jl:111
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/9rK1I/src/driver.jl:333 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/9rK1I/src/driver.jl:331 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/9rK1I/src/utils.jl:62
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:326
  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/9rK1I/src/cache.jl:89
  [8] cufunction(f::GPUArrays.var"#broadcast_kernel#17", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:297
  [9] cufunction(f::GPUArrays.var"#broadcast_kernel#17", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}})
    @ CUDA ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:291
 [10] macro expansion
    @ ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:102 [inlined]
 [11] #launch_heuristic#234
    @ ~/.julia/packages/CUDA/Xt3hr/src/gpuarrays.jl:17 [inlined]
 [12] copyto!
    @ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:65 [inlined]
 [13] copyto!
    @ ./broadcast.jl:936 [inlined]
 [14] copy
    @ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:47 [inlined]
 [15] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, var"#1#2"{CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}}, Tuple{CuArray{UnitRange{Int64}, 1, CUDA.Mem.DeviceBuffer}}})
    @ Base.Broadcast ./broadcast.jl:883
 [16] map(::Function, ::CuArray{UnitRange{Int64}, 1, CUDA.Mem.DeviceBuffer})
    @ GPUArrays ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:90
 [17] rangesum(x::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, rr::CuArray{UnitRange{Int64}, 1, CUDA.Mem.DeviceBuffer})
    @ Main ./REPL[4]:2
 [18] top-level scope
    @ REPL[8]:2
 [19] top-level scope
    @ ~/.julia/packages/CUDA/Xt3hr/src/initialization.jl:52

My intention with this is to do the entire map on the GPU, in order to use the result in further GPU computations. Shouldn’t this be possible?

fedoroff · October 27, 2021, 6:56am

Why do you convert the range into array? Instead of r = [1:10] you can simply write r = 1:10 and then use it in both cases without any conversion to CuArray. If you need more complicated indexing pattern, you can use CartesianIndices.

ogoid · October 27, 2021, 1:30pm

Because I’ll receive a list of ranges in order to do computations in segments of the source array. Something like r = [1:10, 12:13, 17:21] etc. I’ll update my example and take a look at CartesianIndices.

ogoid · October 27, 2021, 1:56pm

Actually, I just noticed it works if we do the operation on a view instead of a slice:

julia> using CUDA; CUDA.allowscalar(false)

julia> test(x, r::UnitRange) = sum(view(x, r))
test (generic function with 1 method)

julia> test(x, rr::AbstractVector) = map(r->test(x,r), rr)
test (generic function with 2 methods)

julia> x = collect(1:10000) |> cu;

julia> r = [1:10, 20:100, 130:150]
3-element Vector{UnitRange{Int64}}:
 1:10
 20:100
 130:150

julia> test(x, r)
3-element Vector{Int64}:
   55
 4860
 2940

julia> test(x, cu(r))
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
   55
 4860
 2940

I suppose this is a bug?

fedoroff · October 27, 2021, 2:09pm

With CartesianIndices you can do the following

r1 = CartesianIndices((1:10,))
r2 = CartesianIndices((12:13,))
r3 = CartesianIndices((17:21,))
r = vcat(r1, r2, r3)

Then you can use r as a simple range.

ogoid · October 27, 2021, 2:14pm

This seems to flatten the indices into a single range. My intention is to apply the operation once per range. I don’t think CartesianIndices is necessary here.

maleadt · October 27, 2021, 4:56pm

Just some missing functionality: Indexing a CuDeviceArray with a UnitRange isn’t implemented, and triggers GPU-incompatible functionality. AFAIK view with UnitRange isn’t specialized either, but the generic functionality from Base apparently isn’t doing anything GPU incompatible.

ogoid · October 27, 2021, 9:57pm

How hard would it be to add support for this?

Unfortunately, it seems even the view method does not work with more complex operations involving array allocations/copying.

Thanks a lot.

ogoid · October 28, 2021, 9:27pm

I’ve opened a feature request issue in Github for this:

https://github.com/JuliaGPU/CUDA.jl/issues/1222

Topic		Replies	Views
CuArrays.jl errors with indexing / CartesianIndices General Usage gpu , indexing , cuda	1	603	July 24, 2019
Range indexing with multidimensional arrays? New to Julia	2	403	August 15, 2018
Error of view on CuArrays with discrete indices GPU	0	783	January 17, 2019
CUDA CPU allocations with range General Usage cuda	5	842	January 13, 2022
Performing vector operations on a specific range of a CuArray GPU question , gpu , cuda	2	668	July 7, 2020

Error when mapping a CuVector{UnitRange} to index another gpu Array

Related topics