I’m running into an error while trying to use a CuVector{UnitRange} as an index to map another CuArray. I’m not sure if it’s a bug or not.
I think I can explain it better with this example:
using CUDA
CUDA.allowscalar(false)
# example: sum only a part of an array
rangesum(x, r::UnitRange) = sum(x[r])
# broadcast version
rangesum(x, rr::AbstractVector) = map(r -> rangesum(x, r), rr)
x = collect(1:100) |> cu
r = [1:10, 33:37, 50:80]
# this works fine, but does a normal cpu map
rangesum(x, r)
# but fails if we convert r into a gpu vector
rangesum(x, cu(r))
The last line throws:
ERROR: InvalidIRError: compiling kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to print_to_string(xs...) in Base at strings/io.jl:124)
Stacktrace
Stacktrace:
[1] string
@ ./strings/io.jl:174
[2] throw_checksize_error
@ ./multidimensional.jl:881
[3] _unsafe_getindex
@ ./multidimensional.jl:845
[4] _getindex
@ ./multidimensional.jl:832
[5] getindex
@ ./abstractarray.jl:1170
[6] rangesum
@ ./REPL[3]:2
[7] #1
@ ./REPL[4]:2
[8] _broadcast_getindex_evalf
@ ./broadcast.jl:648
[9] _broadcast_getindex
@ ./broadcast.jl:621
[10] getindex
@ ./broadcast.jl:575
[11] broadcast_kernel
@ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:59
Reason: unsupported call through a literal pointer (call to )
Stacktrace:
[1] Array
@ ./boot.jl:448
[2] Array
@ ./boot.jl:457
[3] similar
@ ./abstractarray.jl:750
[4] similar
@ ./abstractarray.jl:740
[5] _unsafe_getindex
@ ./multidimensional.jl:844
[6] _getindex
@ ./multidimensional.jl:832
[7] getindex
@ ./abstractarray.jl:1170
[8] rangesum
@ ./REPL[3]:2
[9] #1
@ ./REPL[4]:2
[10] _broadcast_getindex_evalf
@ ./broadcast.jl:648
[11] _broadcast_getindex
@ ./broadcast.jl:621
[12] getindex
@ ./broadcast.jl:575
[13] broadcast_kernel
@ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:59
Stacktrace:
[1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, args::LLVM.Module)
@ GPUCompiler ~/.julia/packages/GPUCompiler/9rK1I/src/validation.jl:111
[2] macro expansion
@ ~/.julia/packages/GPUCompiler/9rK1I/src/driver.jl:333 [inlined]
[3] macro expansion
@ ~/.julia/packages/TimerOutputs/SSeq1/src/TimerOutput.jl:252 [inlined]
[4] macro expansion
@ ~/.julia/packages/GPUCompiler/9rK1I/src/driver.jl:331 [inlined]
[5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
@ GPUCompiler ~/.julia/packages/GPUCompiler/9rK1I/src/utils.jl:62
[6] cufunction_compile(job::GPUCompiler.CompilerJob)
@ CUDA ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:326
[7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/9rK1I/src/cache.jl:89
[8] cufunction(f::GPUArrays.var"#broadcast_kernel#17", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ CUDA ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:297
[9] cufunction(f::GPUArrays.var"#broadcast_kernel#17", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Int64, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, var"#1#2"{CuDeviceVector{Int64, 1}}, Tuple{Base.Broadcast.Extruded{CuDeviceVector{UnitRange{Int64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}})
@ CUDA ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:291
[10] macro expansion
@ ~/.julia/packages/CUDA/Xt3hr/src/compiler/execution.jl:102 [inlined]
[11] #launch_heuristic#234
@ ~/.julia/packages/CUDA/Xt3hr/src/gpuarrays.jl:17 [inlined]
[12] copyto!
@ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:65 [inlined]
[13] copyto!
@ ./broadcast.jl:936 [inlined]
[14] copy
@ ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:47 [inlined]
[15] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, var"#1#2"{CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}}, Tuple{CuArray{UnitRange{Int64}, 1, CUDA.Mem.DeviceBuffer}}})
@ Base.Broadcast ./broadcast.jl:883
[16] map(::Function, ::CuArray{UnitRange{Int64}, 1, CUDA.Mem.DeviceBuffer})
@ GPUArrays ~/.julia/packages/GPUArrays/3sW6s/src/host/broadcast.jl:90
[17] rangesum(x::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, rr::CuArray{UnitRange{Int64}, 1, CUDA.Mem.DeviceBuffer})
@ Main ./REPL[4]:2
[18] top-level scope
@ REPL[8]:2
[19] top-level scope
@ ~/.julia/packages/CUDA/Xt3hr/src/initialization.jl:52
My intention with this is to do the entire map on the GPU, in order to use the result in further GPU computations. Shouldn’t this be possible?