InvalidIRError when running AcceleratedKernels.sum on a GPU SubArray (CuArray view)

I’m running into an issue when trying to use AcceleratedKernels.sum on a GPU SubArray (a view into a CuArray).
Calling AK.sum on the full array works fine, but calling it on the view throws an InvalidIRError.

Question

  • Is this a known issue with AcceleratedKernels when using SubArray views of CuArray?
  • Is there a workaround?

If anyone knows whether this is expected behavior, a known limitation, or has a recommended workaround, I would appreciate it.

Reproducible Example

julia> import AcceleratedKernels as AK
julia> using CUDA

julia> A = CUDA.zeros(100,100,100);
julia> A_sub = view(A, 2:99, 2:99, 2:99);

julia> sum(A)
0.0f0

julia> sum(A_sub)
0.0f0

julia> AK.sum(A)
0.0f0

julia> AK.sum(A_sub)
warning: linking module flags 'Dwarf Version': IDs have conflicting values ('i32 4' from globals with 'i32 2' from start)
ERROR: InvalidIRError: compiling MethodInstance for AcceleratedKernels.gpu__mapreduce_block!(::KernelAbstractions.CompilerMetadata{…}, ::SubArray{…}, ::CuDeviceVector{…}, ::typeof(identity), ::typeof(+), ::Float32) resulted in invalid LLVM IR
Reason: unsupported call to a lazy-initialized function (call to ijl_rethrow)
Stacktrace:
 [1] rethrow
   @ ./error.jl:71
 [2] multiple call sites
   @ unknown:0
Reason: unsupported call to an external C function
Reason: unsupported call to a lazy-initialized function (call to jl_genericmemory_to_string)
Stacktrace:
 [1] unsafe_takestring
   @ ./strings/string.jl:84
 [2] bin
   @ ./intfuncs.jl:837
 [3] #string#403
   @ ./intfuncs.jl:994
 [4] multiple call sites
   @ unknown:0
Reason: unsupported call to an external C function
Stacktrace:
 [1] _string_n
   @ ./strings/string.jl:109
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] bin
   @ ./intfuncs.jl:816
 [4] #string#403
   @ ./intfuncs.jl:994
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to a lazy-initialized function (call to jl_string_to_genericmemory)
Stacktrace:
 [1] unsafe_wrap
   @ ./strings/string.jl:120
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] bin
   @ ./intfuncs.jl:816
 [4] #string#403
   @ ./intfuncs.jl:994
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to a lazy-initialized function (call to jl_genericmemory_to_string)
Stacktrace:
 [1] unsafe_takestring
   @ ./strings/string.jl:84
 [2] hex
   @ ./intfuncs.jl:942
 [3] #string#403
   @ ./intfuncs.jl:1003
 [4] multiple call sites
   @ unknown:0
Reason: unsupported call to an external C function
Stacktrace:
 [1] _string_n
   @ ./strings/string.jl:109
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] hex
   @ ./intfuncs.jl:927
 [4] #string#403
   @ ./intfuncs.jl:1003
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to a lazy-initialized function (call to jl_string_to_genericmemory)
Stacktrace:
 [1] unsafe_wrap
   @ ./strings/string.jl:120
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] hex
   @ ./intfuncs.jl:927
 [4] #string#403
   @ ./intfuncs.jl:1003
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to an unknown function (call to julia.get_pgcstack)
Stacktrace:
 [1] print_to_string
   @ ./strings/io.jl:151
 [2] string
   @ ./strings/io.jl:193
 [3] _throw_dmrs
   @ ./reshapedarray.jl:225
 [4] throw_boundserror
   @ ~/.julia/packages/CUDA/g94EB/src/device/quirks.jl:53
 [5] checkbounds
   @ ./abstractarray.jl:699
 [6] getindex
   @ ./subarray.jl:315
 [7] macro expansion
   @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:23
 [8] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to an unknown function (call to ijl_excstack_state)
Stacktrace:
 [1] print
   @ ./strings/io.jl:34
 [2] print_to_string
   @ ./strings/io.jl:151
 [3] string
   @ ./strings/io.jl:193
 [4] _throw_dmrs
   @ ./reshapedarray.jl:225
 [5] throw_boundserror
   @ ~/.julia/packages/CUDA/g94EB/src/device/quirks.jl:53
 [6] checkbounds
   @ ./abstractarray.jl:699
 [7] getindex
   @ ./subarray.jl:315
 [8] macro expansion
   @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:23
 [9] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to an unknown function (call to julia.except_enter)
Stacktrace:
 [1] print
   @ ./strings/io.jl:34
 [2] print_to_string
   @ ./strings/io.jl:151
 [3] string
   @ ./strings/io.jl:193
 [4] _throw_dmrs
   @ ./reshapedarray.jl:225
 [5] throw_boundserror
   @ ~/.julia/packages/CUDA/g94EB/src/device/quirks.jl:53
 [6] checkbounds
   @ ./abstractarray.jl:699
 [7] getindex
   @ ./subarray.jl:315
 [8] macro expansion
   @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:23
 [9] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to an unknown function (call to ijl_pop_handler_noexcept)
Stacktrace:
 [1] print
   @ ./strings/io.jl:35
 [2] print_to_string
   @ ./strings/io.jl:151
 [3] string
   @ ./strings/io.jl:193
 [4] _throw_dmrs
   @ ./reshapedarray.jl:225
 [5] throw_boundserror
   @ ~/.julia/packages/CUDA/g94EB/src/device/quirks.jl:53
 [6] checkbounds
   @ ./abstractarray.jl:699
 [7] getindex
   @ ./subarray.jl:315
 [8] macro expansion
   @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:23
 [9] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to an unknown function (call to ijl_pop_handler)
Stacktrace:
 [1] print
   @ ./strings/io.jl:34
 [2] print_to_string
   @ ./strings/io.jl:151
 [3] string
   @ ./strings/io.jl:193
 [4] _throw_dmrs
   @ ./reshapedarray.jl:225
 [5] throw_boundserror
   @ ~/.julia/packages/CUDA/g94EB/src/device/quirks.jl:53
 [6] checkbounds
   @ ./abstractarray.jl:699
 [7] getindex
   @ ./subarray.jl:315
 [8] macro expansion
   @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:23
 [9] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to a lazy-initialized function (call to jl_genericmemory_to_string)
Stacktrace:
 [1] unsafe_takestring
   @ ./strings/string.jl:84
 [2] oct
   @ ./intfuncs.jl:851
 [3] #string#403
   @ ./intfuncs.jl:997
 [4] multiple call sites
   @ unknown:0
Reason: unsupported call to an external C function
Stacktrace:
 [1] _string_n
   @ ./strings/string.jl:109
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] oct
   @ ./intfuncs.jl:843
 [4] #string#403
   @ ./intfuncs.jl:997
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to a lazy-initialized function (call to jl_string_to_genericmemory)
Stacktrace:
 [1] unsafe_wrap
   @ ./strings/string.jl:120
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] oct
   @ ./intfuncs.jl:843
 [4] #string#403
   @ ./intfuncs.jl:997
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to a lazy-initialized function (call to jl_genericmemory_to_string)
Stacktrace:
 [1] unsafe_takestring
   @ ./strings/string.jl:84
 [2] dec
   @ ./intfuncs.jl:921
 [3] #string#403
   @ ./intfuncs.jl:1000
 [4] multiple call sites
   @ unknown:0
Reason: unsupported call to an external C function
Stacktrace:
 [1] _string_n
   @ ./strings/string.jl:109
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] dec
   @ ./intfuncs.jl:918
 [4] #string#403
   @ ./intfuncs.jl:1000
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to a lazy-initialized function (call to jl_string_to_genericmemory)
Stacktrace:
 [1] unsafe_wrap
   @ ./strings/string.jl:120
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] dec
   @ ./intfuncs.jl:918
 [4] #string#403
   @ ./intfuncs.jl:1000
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to an external C function
Stacktrace:
 [1] _string_n
   @ ./strings/string.jl:109
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] _similar_data
   @ ./iobuffer.jl:296
 [4] _resize!
   @ ./iobuffer.jl:544
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to a lazy-initialized function (call to jl_string_to_genericmemory)
Stacktrace:
 [1] unsafe_wrap
   @ ./strings/string.jl:120
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] _similar_data
   @ ./iobuffer.jl:296
 [4] _resize!
   @ ./iobuffer.jl:544
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to a lazy-initialized function (call to jl_genericmemory_to_string)
Stacktrace:
 [1] unsafe_takestring
   @ ./strings/string.jl:84
 [2] _base
   @ ./intfuncs.jl:967
 [3] #string#403
   @ ./intfuncs.jl:1005
 [4] multiple call sites
   @ unknown:0
Reason: unsupported call to an external C function
Stacktrace:
 [1] _string_n
   @ ./strings/string.jl:109
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] _base
   @ ./intfuncs.jl:954
 [4] #string#403
   @ ./intfuncs.jl:1005
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to a lazy-initialized function (call to jl_string_to_genericmemory)
Stacktrace:
 [1] unsafe_wrap
   @ ./strings/string.jl:120
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] _base
   @ ./intfuncs.jl:954
 [4] #string#403
   @ ./intfuncs.jl:1005
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to an unknown function (call to julia.new_gc_frame)
Stacktrace:
 [1] multiple call sites
   @ unknown:0
Reason: unsupported call to an unknown function (call to julia.get_pgcstack)
Stacktrace:
 [1] multiple call sites
   @ unknown:0
Reason: unsupported call to an unknown function (call to julia.push_gc_frame)
Stacktrace:
 [1] multiple call sites
   @ unknown:0
Reason: unsupported call to an unknown function (call to ijl_excstack_state)
Stacktrace:
 [1] print
   @ ./strings/io.jl:34
 [2] multiple call sites
   @ unknown:0
Reason: unsupported call to an unknown function (call to julia.except_enter)
Stacktrace:
 [1] print
   @ ./strings/io.jl:34
 [2] multiple call sites
   @ unknown:0
Reason: unsupported call to an unknown function (call to julia.get_gc_frame_slot)
Stacktrace:
 [1] multiple call sites
   @ unknown:0
Reason: unsupported call to an unknown function (call to ijl_pop_handler_noexcept)
Stacktrace:
 [1] print
   @ ./strings/io.jl:35
 [2] multiple call sites
   @ unknown:0
Reason: unsupported call to an unknown function (call to julia.pop_gc_frame)
Stacktrace:
 [1] multiple call sites
   @ unknown:0
Reason: unsupported call to an unknown function (call to ijl_pop_handler)
Stacktrace:
 [1] print
   @ ./strings/io.jl:34
 [2] multiple call sites
   @ unknown:0
Reason: unsupported call to an external C function
Stacktrace:
 [1] _string_n
   @ ./strings/string.jl:109
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] _similar_data
   @ ./iobuffer.jl:296
 [4] ensureroom_reallocate
   @ ./iobuffer.jl:619
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to a lazy-initialized function (call to jl_string_to_genericmemory)
Stacktrace:
 [1] unsafe_wrap
   @ ./strings/string.jl:120
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] _similar_data
   @ ./iobuffer.jl:296
 [4] ensureroom_reallocate
   @ ./iobuffer.jl:619
 [5] multiple call sites
   @ unknown:0
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
Stacktrace:
  [1] getindex
    @ ./tuple.jl:33
  [2] iterate
    @ ./tuple.jl:74
  [3] print_to_string
    @ ./strings/io.jl:147
  [4] string
    @ ./strings/io.jl:193
  [5] _throw_dmrs
    @ ./reshapedarray.jl:225
  [6] throw_boundserror
    @ ~/.julia/packages/CUDA/g94EB/src/device/quirks.jl:53
  [7] checkbounds
    @ ./abstractarray.jl:699
  [8] getindex
    @ ./subarray.jl:315
  [9] macro expansion
    @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:23
 [10] gpu__mapreduce_block!
    @ ./none:0
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
Stacktrace:
  [1] getindex
    @ ./tuple.jl:33
  [2] iterate
    @ ./tuple.jl:74
  [3] print_to_string
    @ ./strings/io.jl:152
  [4] string
    @ ./strings/io.jl:193
  [5] _throw_dmrs
    @ ./reshapedarray.jl:225
  [6] throw_boundserror
    @ ~/.julia/packages/CUDA/g94EB/src/device/quirks.jl:53
  [7] checkbounds
    @ ./abstractarray.jl:699
  [8] getindex
    @ ./subarray.jl:315
  [9] macro expansion
    @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:23
 [10] gpu__mapreduce_block!
    @ ./none:0
Reason: unsupported call to a lazy-initialized function (call to jl_genericmemory_to_string)
Stacktrace:
 [1] String
   @ ./strings/string.jl:71
 [2] print_to_string
   @ ./strings/io.jl:153
 [3] string
   @ ./strings/io.jl:193
 [4] _throw_dmrs
   @ ./reshapedarray.jl:225
 [5] throw_boundserror
   @ ~/.julia/packages/CUDA/g94EB/src/device/quirks.jl:53
 [6] checkbounds
   @ ./abstractarray.jl:699
 [7] getindex
   @ ./subarray.jl:315
 [8] macro expansion
   @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:23
 [9] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to a lazy-initialized function (call to ijl_pchar_to_string)
Stacktrace:
 [1] String
   @ ./strings/string.jl:73
 [2] print_to_string
   @ ./strings/io.jl:153
 [3] string
   @ ./strings/io.jl:193
 [4] _throw_dmrs
   @ ./reshapedarray.jl:225
 [5] throw_boundserror
   @ ~/.julia/packages/CUDA/g94EB/src/device/quirks.jl:53
 [6] checkbounds
   @ ./abstractarray.jl:699
 [7] getindex
   @ ./subarray.jl:315
 [8] macro expansion
   @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:23
 [9] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to an external C function
Stacktrace:
  [1] _string_n
    @ ./strings/string.jl:109
  [2] StringMemory
    @ ./iobuffer.jl:167
  [3] #IOBuffer#390
    @ ./iobuffer.jl:266
  [4] GenericIOBuffer
    @ ./iobuffer.jl:245
  [5] print_to_string
    @ ./strings/io.jl:149
  [6] string
    @ ./strings/io.jl:193
  [7] _throw_dmrs
    @ ./reshapedarray.jl:225
  [8] throw_boundserror
    @ ~/.julia/packages/CUDA/g94EB/src/device/quirks.jl:53
  [9] checkbounds
    @ ./abstractarray.jl:699
 [10] getindex
    @ ./subarray.jl:315
 [11] macro expansion
    @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:23
 [12] gpu__mapreduce_block!
    @ ./none:0
Reason: unsupported call to a lazy-initialized function (call to jl_string_to_genericmemory)
Stacktrace:
  [1] unsafe_wrap
    @ ./strings/string.jl:120
  [2] StringMemory
    @ ./iobuffer.jl:167
  [3] #IOBuffer#390
    @ ./iobuffer.jl:266
  [4] GenericIOBuffer
    @ ./iobuffer.jl:245
  [5] print_to_string
    @ ./strings/io.jl:149
  [6] string
    @ ./strings/io.jl:193
  [7] _throw_dmrs
    @ ./reshapedarray.jl:225
  [8] throw_boundserror
    @ ~/.julia/packages/CUDA/g94EB/src/device/quirks.jl:53
  [9] checkbounds
    @ ./abstractarray.jl:699
 [10] getindex
    @ ./subarray.jl:315
 [11] macro expansion
    @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:23
 [12] gpu__mapreduce_block!
    @ ./none:0
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
Stacktrace:
 [1] getindex
   @ ./tuple.jl:33
 [2] iterate
   @ ./tuple.jl:74
 [3] print_to_string
   @ ./strings/io.jl:147
 [4] string
   @ ./strings/io.jl:193
 [5] SignedMultiplicativeInverse
   @ ./multinverses.jl:54
 [6] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to an unknown function (call to ijl_get_nth_field_checked)
Stacktrace:
 [1] getindex
   @ ./tuple.jl:33
 [2] iterate
   @ ./tuple.jl:74
 [3] print_to_string
   @ ./strings/io.jl:152
 [4] string
   @ ./strings/io.jl:193
 [5] SignedMultiplicativeInverse
   @ ./multinverses.jl:54
 [6] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to a lazy-initialized function (call to jl_genericmemory_to_string)
Stacktrace:
 [1] String
   @ ./strings/string.jl:71
 [2] print_to_string
   @ ./strings/io.jl:153
 [3] string
   @ ./strings/io.jl:193
 [4] SignedMultiplicativeInverse
   @ ./multinverses.jl:54
 [5] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to a lazy-initialized function (call to ijl_pchar_to_string)
Stacktrace:
 [1] String
   @ ./strings/string.jl:73
 [2] print_to_string
   @ ./strings/io.jl:153
 [3] string
   @ ./strings/io.jl:193
 [4] SignedMultiplicativeInverse
   @ ./multinverses.jl:54
 [5] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to an external C function
Stacktrace:
 [1] _string_n
   @ ./strings/string.jl:109
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] #IOBuffer#390
   @ ./iobuffer.jl:266
 [4] GenericIOBuffer
   @ ./iobuffer.jl:245
 [5] print_to_string
   @ ./strings/io.jl:149
 [6] string
   @ ./strings/io.jl:193
 [7] SignedMultiplicativeInverse
   @ ./multinverses.jl:54
 [8] gpu__mapreduce_block!
   @ ./none:0
Reason: unsupported call to a lazy-initialized function (call to jl_string_to_genericmemory)
Stacktrace:
 [1] unsafe_wrap
   @ ./strings/string.jl:120
 [2] StringMemory
   @ ./iobuffer.jl:167
 [3] #IOBuffer#390
   @ ./iobuffer.jl:266
 [4] GenericIOBuffer
   @ ./iobuffer.jl:245
 [5] print_to_string
   @ ./strings/io.jl:149
 [6] string
   @ ./strings/io.jl:193
 [7] SignedMultiplicativeInverse
   @ ./multinverses.jl:54
 [8] gpu__mapreduce_block!
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erroneous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Gp8bZ/src/validation.jl:167
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/Gp8bZ/src/driver.jl:417 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/Tracy/tYwAE/src/tracepoint.jl:163 [inlined]
  [4] emit_llvm(job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Gp8bZ/src/driver.jl:416
  [5] emit_llvm
    @ ~/.julia/packages/GPUCompiler/Gp8bZ/src/driver.jl:182 [inlined]
  [6] compile_unhooked(output::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Gp8bZ/src/driver.jl:95
  [7] compile_unhooked
    @ ~/.julia/packages/GPUCompiler/Gp8bZ/src/driver.jl:80 [inlined]
  [8] compile(target::Symbol, job::GPUCompiler.CompilerJob; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Gp8bZ/src/driver.jl:67
  [9] compile
    @ ~/.julia/packages/GPUCompiler/Gp8bZ/src/driver.jl:55 [inlined]
 [10] #compile##0
    @ ~/.julia/packages/CUDA/g94EB/src/compiler/compilation.jl:250 [inlined]
 [11] JuliaContext(f::CUDA.var"#compile##0#compile##1"{GPUCompiler.CompilerJob{…}}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Gp8bZ/src/driver.jl:34
 [12] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Gp8bZ/src/driver.jl:25
 [13] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/g94EB/src/compiler/compilation.jl:249
 [14] actual_compilation(cache::Dict{…}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{…}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Gp8bZ/src/execution.jl:245
 [15] cached_compilation(cache::Dict{…}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{…}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/Gp8bZ/src/execution.jl:159
 [16] macro expansion
    @ ~/.julia/packages/CUDA/g94EB/src/compiler/execution.jl:373 [inlined]
 [17] macro expansion
    @ ./lock.jl:376 [inlined]
 [18] cufunction(f::typeof(AcceleratedKernels.gpu__mapreduce_block!), tt::Type{…}; kwargs::@Kwargs{…})
    @ CUDA ~/.julia/packages/CUDA/g94EB/src/compiler/execution.jl:368
 [19] macro expansion
    @ ~/.julia/packages/CUDA/g94EB/src/compiler/execution.jl:112 [inlined]
 [20] (::KernelAbstractions.Kernel{…})(::SubArray{…}, ::Vararg{…}; ndrange::Tuple{…}, workgroupsize::Nothing)
    @ CUDA.CUDAKernels ~/.julia/packages/CUDA/g94EB/src/CUDAKernels.jl:127
 [21] mapreduce_1d_gpu(f::typeof(identity), op::typeof(+), src::SubArray{…}, backend::CUDABackend; init::Float32, neutral::Float32, max_tasks::Int64, min_elems::Int64, block_size::Int64, temp::Nothing, switch_below::Int64)
    @ AcceleratedKernels ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:95
 [22] mapreduce_1d_gpu
    @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/mapreduce_1d_gpu.jl:49 [inlined]
 [23] #_mapreduce_impl#83
    @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/reduce.jl:187 [inlined]
 [24] _mapreduce_impl
    @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/reduce.jl:169 [inlined]
 [25] #reduce#81
    @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/reduce.jl:81 [inlined]
 [26] reduce
    @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/reduce/reduce.jl:76 [inlined]
 [27] #sum#137
    @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/arithmetics.jl:49 [inlined]
 [28] sum
    @ ~/.julia/packages/AcceleratedKernels/AdYRJ/src/arithmetics.jl:44 [inlined]
 [29] sum(src::SubArray{Float32, 3, CuArray{…}, Tuple{…}, false})
    @ AcceleratedKernels ~/.julia/packages/AcceleratedKernels/AdYRJ/src/arithmetics.jl:44
 [30] top-level scope
    @ REPL[16]:1
Some type information was truncated. Use `show(err)` to see complete types.
julia> versioninfo()
Julia Version 1.12.1
Commit ba1e628ee49 (2025-10-17 13:02 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × 13th Gen Intel(R) Core(TM) i7-13700K
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, alderlake)
  GC: Built with stock GC
Threads: 24 default, 1 interactive, 24 GC (on 24 virtual cores)
Environment:
  JULIA_PKG_USE_CLI_GIT = true
  JULIA_CONDAPKG_BACKEND = Null
  JULIA_PYTHONCALL_EXE = pvpython
  JULIA_EDITOR = code
  JULIA_VSCODE_REPL = 1

julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 13.0, local installation
- driver 581.42.0 for 13.0
- compiler 13.0

CUDA libraries: 
- CUBLAS: 13.1.0
- CURAND: 10.4.0
- CUFFT: 12.0.0
- CUSOLVER: 12.0.4
- CUSPARSE: 12.6.3
- CUPTI: 2025.3.1 (API 13.0.1)
- NVML: 13.0.0+580.95.2

Julia packages: 
- CUDA: 5.9.3
- CUDA_Driver_jll: 13.0.2+0
- CUDA_Compiler_jll: 0.3.0+0
- CUDA_Runtime_jll: 0.19.2+0
- CUDA_Runtime_Discovery: 1.0.0

Toolchain:
- Julia: 1.12.1
- LLVM: 18.1.7

Preferences:
- CUDA_Runtime_jll.version: 13.0
- CUDA_Runtime_jll.local: true

1 device:
  0: NVIDIA GeForce RTX 4070 Ti (sm_89, 8.172 GiB / 11.994 GiB available)

pkg> st AcceleratedKernels
Status `Project.toml.arch`
  [6a4ca0a5] AcceleratedKernels v0.4.3
  [052768ef] CUDA v5.9.3