Power of a given value into CUDA.jl kernel

Hi!

I’m using Julia V1.4.2 + CUDA.jl and I have an issue with a CUDA kernel. I’m trying to power the elements of an array, just like the example below:


using CUDA

T = [0.0 0.666667 0.0; -0.333333 -0.25 0.333333; -0.5 -0.25 0.3333]

d_T = CuArray{Float32}(T)

function kernel(T)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    @cushow T[i,i]^4
    return
end

@cuda threads = 3 kernel(d_T)

but the output is the next error:

LLVM error: Cannot select: 0x6933878: f32 = fpow 0x6933c20, ConstantFP:f32<4.000000e+00>, math.jl:884 @[ intfuncs.jl:265 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/intrinsics/output.jl:230 @[ In[16]:4 ] ] ]
  0x6933c20: f32,ch = load<(load 4 from %ir.31, !tbaa !128, addrspace 1)> 0x52cc2a8, 0x69330c0, undef:i64, /home/oscar/.julia/packages/LLVM/T8ZBA/src/interop/base.jl:53 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:115 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:115 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:245 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/array.jl:80 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/array.jl:99 @[ abstractarray.jl:1003 @[ abstractarray.jl:980 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/intrinsics/output.jl:230 @[ In[16]:4 ] ] ] ] ] ] ] ] ]
    0x69330c0: i64,ch = CopyFromReg 0x52cc2a8, Register:i64 %0, /home/oscar/.julia/packages/LLVM/T8ZBA/src/interop/base.jl:53 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:115 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:115 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:245 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/array.jl:80 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/array.jl:99 @[ abstractarray.jl:1003 @[ abstractarray.jl:980 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/intrinsics/output.jl:230 @[ In[16]:4 ] ] ] ] ] ] ] ] ]
      0x6932d80: i64 = Register %0
    0x6932ff0: i64 = undef
  0x6933ae8: f32 = ConstantFP<4.000000e+00>
In function: _Z18julia_kernel_2006913CuDeviceArrayI7Float32Li1E6GlobalES_IS0_Li1ES1_ES_IS0_Li1ES1_ES_IS0_Li2ES1_ES_IS0_Li2ES1_ES_I5Int32Li1ES1_ES_I5Int64Li1ES1_ES_IS0_Li1ES1_E

Stacktrace:
 [1] handle_error(::Cstring) at /home/oscar/.julia/packages/LLVM/T8ZBA/src/core/context.jl:105
 [2] macro expansion at /home/oscar/.julia/packages/LLVM/T8ZBA/src/util.jl:109 [inlined]
 [3] LLVMTargetMachineEmitToMemoryBuffer(::LLVM.TargetMachine, ::LLVM.Module, ::LLVM.API.LLVMCodeGenFileType, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/oscar/.julia/packages/LLVM/T8ZBA/lib/libLLVM_h.jl:3512
 [4] emit(::LLVM.TargetMachine, ::LLVM.Module, ::LLVM.API.LLVMCodeGenFileType) at /home/oscar/.julia/packages/LLVM/T8ZBA/src/targetmachine.jl:43
 [5] mcgen at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/mcgen.jl:73 [inlined]
 [6] macro expansion at /home/oscar/.julia/packages/TimerOutputs/dVnaw/src/TimerOutput.jl:206 [inlined]
 [7] macro expansion at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/driver.jl:254 [inlined]
 [8] macro expansion at /home/oscar/.julia/packages/TimerOutputs/dVnaw/src/TimerOutput.jl:206 [inlined]
 [9] codegen(::Symbol, ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/driver.jl:250
 [10] compile(::Symbol, ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/driver.jl:39
 [11] compile at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/driver.jl:35 [inlined]
 [12] _cufunction(::GPUCompiler.FunctionSpec{typeof(kernel),Tuple{CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Int32,1,CUDA.AS.Global},CuDeviceArray{Int64,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global}}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/oscar/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:310
 [13] _cufunction at /home/oscar/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:304 [inlined]
 [14] check_cache(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(kernel),Tuple{CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Int32,1,CUDA.AS.Global},CuDeviceArray{Int64,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:24
 [15] kernel at ./In[16]:2 [inlined]
 [16] cached_compilation(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(kernel),Tuple{CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Int32,1,CUDA.AS.Global},CuDeviceArray{Int64,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:0
 [17] cached_compilation at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:44 [inlined]
 [18] cufunction(::typeof(kernel), ::Type{Tuple{CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Int32,1,CUDA.AS.Global},CuDeviceArray{Int64,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global}}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/oscar/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:298
 [19] cufunction(::typeof(kernel), ::Type{Tuple{CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Int32,1,CUDA.AS.Global},CuDeviceArray{Int64,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global}}}) at /home/oscar/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:293
 [20] top-level scope at /home/oscar/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:109
 [21] top-level scope at In[17]:1

I thought that the problem was with the power operation, but if we do something like the next kernel it works:


using CUDA

function kernel()
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    @cushow Float32(-0.25)^4
    return
end

@cuda threads = 3 kernel()

So, do you know what I’m doing wrong or what could be the problem?

Thanks a lot!

Will it work with CUDA.pow(T[i,i], 4) instead of T[i,i]^4 ?

1 Like

Great, thanks a lot!

1 Like