Hi!
I’m using Julia V1.4.2 + CUDA.jl and I have an issue with a CUDA kernel. I’m trying to power the elements of an array, just like the example below:
using CUDA
T = [0.0 0.666667 0.0; -0.333333 -0.25 0.333333; -0.5 -0.25 0.3333]
d_T = CuArray{Float32}(T)
function kernel(T)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
@cushow T[i,i]^4
return
end
@cuda threads = 3 kernel(d_T)
but the output is the next error:
LLVM error: Cannot select: 0x6933878: f32 = fpow 0x6933c20, ConstantFP:f32<4.000000e+00>, math.jl:884 @[ intfuncs.jl:265 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/intrinsics/output.jl:230 @[ In[16]:4 ] ] ]
0x6933c20: f32,ch = load<(load 4 from %ir.31, !tbaa !128, addrspace 1)> 0x52cc2a8, 0x69330c0, undef:i64, /home/oscar/.julia/packages/LLVM/T8ZBA/src/interop/base.jl:53 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:115 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:115 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:245 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/array.jl:80 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/array.jl:99 @[ abstractarray.jl:1003 @[ abstractarray.jl:980 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/intrinsics/output.jl:230 @[ In[16]:4 ] ] ] ] ] ] ] ] ]
0x69330c0: i64,ch = CopyFromReg 0x52cc2a8, Register:i64 %0, /home/oscar/.julia/packages/LLVM/T8ZBA/src/interop/base.jl:53 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:115 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:115 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/pointer.jl:245 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/array.jl:80 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/array.jl:99 @[ abstractarray.jl:1003 @[ abstractarray.jl:980 @[ /home/oscar/.julia/packages/CUDA/7vLVC/src/device/intrinsics/output.jl:230 @[ In[16]:4 ] ] ] ] ] ] ] ] ]
0x6932d80: i64 = Register %0
0x6932ff0: i64 = undef
0x6933ae8: f32 = ConstantFP<4.000000e+00>
In function: _Z18julia_kernel_2006913CuDeviceArrayI7Float32Li1E6GlobalES_IS0_Li1ES1_ES_IS0_Li1ES1_ES_IS0_Li2ES1_ES_IS0_Li2ES1_ES_I5Int32Li1ES1_ES_I5Int64Li1ES1_ES_IS0_Li1ES1_E
Stacktrace:
[1] handle_error(::Cstring) at /home/oscar/.julia/packages/LLVM/T8ZBA/src/core/context.jl:105
[2] macro expansion at /home/oscar/.julia/packages/LLVM/T8ZBA/src/util.jl:109 [inlined]
[3] LLVMTargetMachineEmitToMemoryBuffer(::LLVM.TargetMachine, ::LLVM.Module, ::LLVM.API.LLVMCodeGenFileType, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/oscar/.julia/packages/LLVM/T8ZBA/lib/libLLVM_h.jl:3512
[4] emit(::LLVM.TargetMachine, ::LLVM.Module, ::LLVM.API.LLVMCodeGenFileType) at /home/oscar/.julia/packages/LLVM/T8ZBA/src/targetmachine.jl:43
[5] mcgen at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/mcgen.jl:73 [inlined]
[6] macro expansion at /home/oscar/.julia/packages/TimerOutputs/dVnaw/src/TimerOutput.jl:206 [inlined]
[7] macro expansion at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/driver.jl:254 [inlined]
[8] macro expansion at /home/oscar/.julia/packages/TimerOutputs/dVnaw/src/TimerOutput.jl:206 [inlined]
[9] codegen(::Symbol, ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/driver.jl:250
[10] compile(::Symbol, ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/driver.jl:39
[11] compile at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/driver.jl:35 [inlined]
[12] _cufunction(::GPUCompiler.FunctionSpec{typeof(kernel),Tuple{CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Int32,1,CUDA.AS.Global},CuDeviceArray{Int64,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global}}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/oscar/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:310
[13] _cufunction at /home/oscar/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:304 [inlined]
[14] check_cache(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(kernel),Tuple{CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Int32,1,CUDA.AS.Global},CuDeviceArray{Int64,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:24
[15] kernel at ./In[16]:2 [inlined]
[16] cached_compilation(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(kernel),Tuple{CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Int32,1,CUDA.AS.Global},CuDeviceArray{Int64,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:0
[17] cached_compilation at /home/oscar/.julia/packages/GPUCompiler/pCBTA/src/cache.jl:44 [inlined]
[18] cufunction(::typeof(kernel), ::Type{Tuple{CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Int32,1,CUDA.AS.Global},CuDeviceArray{Int64,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global}}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/oscar/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:298
[19] cufunction(::typeof(kernel), ::Type{Tuple{CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Int32,1,CUDA.AS.Global},CuDeviceArray{Int64,1,CUDA.AS.Global},CuDeviceArray{Float32,1,CUDA.AS.Global}}}) at /home/oscar/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:293
[20] top-level scope at /home/oscar/.julia/packages/CUDA/7vLVC/src/compiler/execution.jl:109
[21] top-level scope at In[17]:1
I thought that the problem was with the power operation, but if we do something like the next kernel it works:
using CUDA
function kernel()
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
@cushow Float32(-0.25)^4
return
end
@cuda threads = 3 kernel()
So, do you know what I’m doing wrong or what could be the problem?
Thanks a lot!