Using mapreduce on GPU with CUDA.jl

Hello,

I’m trying to get acquainted with GPU programming with CUDA.jl, and to start off I’m trying to stick to Array operations to keep things simple.

In this MWE, the GPU version will not compile because of “dynamic function invocation”. I’m stumped as to what I need to change to make this work. Do I need to write a kernel function? Or declare types in a specific way? Any help appreciated.

Code:

using CUDA
CUDA.versioninfo()

m = [1 2 3; 4 5 6]
d_m = CuArray(m)

pow2(x) = x .^ 2

mapreduce(pow2, +, m; dims=1) # CPU version works

mapreduce(pow2, +, d_m; dims=1) # This won't compile

Output:

CUDA runtime 12.3, artifact installation
CUDA driver 12.3
NVIDIA driver 546.33.0

CUDA libraries: 
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+545.36

Julia packages: 
- CUDA: 5.1.1
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.10.0-rc3
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 3070 (sm_86, 6.909 GiB / 8.000 GiB available)

2×3 Matrix{Int64}:
 1  2  3
 4  5  6

2×3 CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}:
 1  2  3
 4  5  6

pow2 (generic function with 1 method)

1×3 Matrix{Int64}:
 17  29  45

ERROR: InvalidIRError: compiling MethodInstance for CUDA.partial_mapreduce_grid(::typeof(identity), ::typeof(+), ::Int64, ::CartesianIndices{…}, ::CartesianIndices{…}, ::Val{…}, ::CuDeviceArray{…}, ::Base.Broadcast.Broadcasted{…}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to julia.new_gc_frame)
Reason: unsupported call to an unknown function (call to julia.push_gc_frame)
Reason: unsupported call to an unknown function (call to julia.get_gc_frame_slot)
Reason: unsupported dynamic function invocation (call to _broadcast_getindex_evalf(f::Tf, args::Vararg{Any, N}) where {Tf, N} @ Base.Broadcast broadcast.jl:709)
Stacktrace:
 [1] _broadcast_getindex
   @ ./broadcast.jl:682
 [2] getindex
   @ ./broadcast.jl:636
 [3] _map_getindex
   @ ~/.julia/packages/CUDA/YIj5X/src/mapreduce.jl:85
 [4] partial_mapreduce_grid
   @ ~/.julia/packages/CUDA/YIj5X/src/mapreduce.jl:122
Reason: unsupported dynamic function invocation (call to +)
Stacktrace:
 [1] partial_mapreduce_grid
   @ ~/.julia/packages/CUDA/YIj5X/src/mapreduce.jl:122
Reason: unsupported dynamic function invocation (call to reduce_block)
Stacktrace:
 [1] partial_mapreduce_grid
   @ ~/.julia/packages/CUDA/YIj5X/src/mapreduce.jl:126
Reason: unsupported dynamic function invocation (call to convert)
Stacktrace:
 [1] setindex!
   @ ~/.julia/packages/CUDA/YIj5X/src/device/array.jl:166
 [2] setindex!
   @ ~/.julia/packages/CUDA/YIj5X/src/device/array.jl:178
 [3] partial_mapreduce_grid
   @ ~/.julia/packages/CUDA/YIj5X/src/mapreduce.jl:130
Reason: unsupported call to an unknown function (call to julia.pop_gc_frame)
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/validation.jl:147
  [2] macro expansion
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:440 [inlined]
  [3] macro expansion
    @ GPUCompiler ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:439 [inlined]
  [5] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/utils.jl:92
  [6] emit_llvm
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/utils.jl:86 [inlined]
  [7] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:129
  [8] codegen
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:110 [inlined]
  [9] 
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:106
 [10] compile
    @ ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:98 [inlined]
 [11] #1075
    @ ~/.julia/packages/CUDA/YIj5X/src/compiler/compilation.jl:247 [inlined]
 [12] JuliaContext(f::CUDA.var"#1075#1077"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
 [13] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/YIj5X/src/compiler/compilation.jl:246
 [14] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:125
 [15] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:103
 [16] macro expansion
    @ ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:382 [inlined]
 [17] macro expansion
    @ ./lock.jl:267 [inlined]
 [18] cufunction(f::typeof(CUDA.partial_mapreduce_grid), tt::Type{Tuple{…}}; kwargs::@Kwargs{})
    @ CUDA ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:377
 [19] cufunction
    @ ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:374 [inlined]
 [20] macro expansion
    @ ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:104 [inlined]
 [21] mapreducedim!(f::typeof(identity), op::typeof(+), R::CuArray{…}, A::Base.Broadcast.Broadcasted{…}; init::Int64)
    @ CUDA ~/.julia/packages/CUDA/YIj5X/src/mapreduce.jl:234
 [22] mapreducedim!
    @ ~/.julia/packages/CUDA/YIj5X/src/mapreduce.jl:169 [inlined]
 [23] _mapreduce(f::typeof(pow2), op::typeof(+), As::CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}; dims::Int64, init::Nothing)
    @ GPUArrays ~/.julia/packages/GPUArrays/dAUOE/src/host/mapreduce.jl:67
 [24] mapreduce(::Function, ::Function, ::CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}; dims::Int64, init::Nothing)
    @ GPUArrays ~/.julia/packages/GPUArrays/dAUOE/src/host/mapreduce.jl:28
 [25] top-level scope
    @ Untitled-1:13
Some type information was truncated. Use `show(err)` to see complete types.

Your map function needs to be scalar, i.e., x ^ 2. It’s not allowed to use other vectorized operations as operators with GPU array abstractions (as a consequence, functions like eachslices aren’t supported).

1 Like