Base function in Cuda kernels

cudanative
cuda

#1

Hi there,

I hope this is not a duplicate. If so, please fill free to shit this post.
I’m relatively new to Julia. I currently working on a numerical project which I want to implement on the GPU. Therefore I’m using the Julia CUDA libraries.

I want to create some initial date using my Nvidia-GPU , so I execute the following lines of code:

using CuArrays
using CUDAnative
using CUDAdrv
using AbstractFFTs, FFTW
using LinearAlgebra

#custom kernels for GPU---------------------------------------------------------
function kernel_initial_psi(a,N,k)
    j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    k = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    l = (blockIdx().z - 1) * blockDim().z + threadIdx().z
    if j <= size(a,1)
        f(x) = CUDAnative.exp(x)
        a[j , k, l]= f(-im*((j-N/2)*k[1]+ (k-N/2)*k[2]+(l-N/2)*k[3]))

    end
    return nothing
end
#Paramters----------------------------------------------------------------------
    N=4 # size of spatial grid
    k=[1.0,1.0,1.0] #inital wave direction
    psi_int=CuArrays.cuzeros(ComplexF64, (N,N,N))
#Threads and Blocks-------------------------------------------------------------
dev = CuDevice(0)
max_threads = attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK)
max = min(N, max_threads)

threads_x = min(max_threads, size(psi_int,3))
blocks = ceil.(Int, (size(psi_int,3), size(psi_int,4)) ./ threads)

#initial data-------------------------------------------------------------------

    @cuda blocks=blocks threads=threads_x  kernel_initial_psi(psi_int, N, k)

and obtain the following error message:


ERROR: LoadError: InvalidIRError: compiling kernel_initial_psi(CuDeviceArray{Complex{Float64},3,CUDAnative.AS.Global}, Int64, Array{Float64,1}) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_apply_generic)
Stacktrace:
 [1] f at C:\Users\Noobie\Documents\TESTJULIAGPU\inital_data.jl:14
 [2] kernel_initial_psi at C:\Users\Noobie\Documents\TESTJULIAGPU\inital_data.jl:15
Stacktrace:
 [1] check_ir(::CUDAnative.CompilerContext, ::LLVM.Module) at C:\Users\Noobie\.julia\packages\CUDAnative\Mdd3w\src\compiler\validation.jl:77
 [2] compile(::CUDAnative.CompilerContext) at C:\Users\Noobie\.julia\packages\CUDAnative\Mdd3w\src\compiler\driver.jl:90
 [3] #cufunction#110(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(cufunction), ::typeof(kernel_initial_psi), ::Type{Tuple{CuDeviceArray{Complex{Float64},3,CUDAnative.AS.Global},Int64,Array{Float64,1}}}) at C:\Users\Noobie\.julia\packages\CUDAnative\Mdd3w\src\compiler\driver.jl:38
 [4] cufunction(::Function, ::Type) at C:\Users\Noobie\.julia\packages\CUDAnative\Mdd3w\src\execution.jl:237
 [5] top-level scope at C:\Users\Noobie\.julia\packages\CUDAnative\Mdd3w\src\execution.jl:205
 [6] top-level scope at gcutils.jl:87
 [7] top-level scope at C:\Users\Noobie\.julia\packages\CUDAnative\Mdd3w\src\execution.jl:202
in expression starting at C:\Users\Noobie\Documents\TESTJULIAGPU\inital_data.jl:41

As I search through the community posts I found that the was a problem with calling Base. functions in CUDA kernels in the past. So I tried out to use the following workaround by defining “f(x) = CUDAnative.exp(x)” but it didn’t work either. My question is if I missed something during my coding or there is some other way to get this working.

Thanks


#2

Please quote your code, this will help us help you!


#3

Thanks, this is what I was looking for. But the main problem still exists :frowning:


#4

hiya,
CUDAnative.exp just wraps the nvidia libdevice functions, and there is no complex version of this function. But you can define it!

define

@inline function CUDAnative.exp(x::ComplexF64)
    scale = CUDAnative.exp( x.re )
    return ComplexF64( scale * CUDAnative.cos(x.im), scale * CUDAnative.sin(x.im) )
end

before defining your kernel function.

The ability to have kernels dispatch on functions for whatever types we define makes CUDAnative freaking awesome. We just have to define the functions! Unfortunately, the error messages are not always so helpful…


#5

Hi,
Your answer makes some things clear now, first thanks for that. But when I define the new function CUDAnative.exp for the ComplexF64 data type I get a new error message that’s driving me a little bit nuts. I mean that is of course caused by my lack of experience with Julia, which is actually “a very funny” to code language, but I was not able to find something relevant for my problem by searching for this error message.

ERROR: LoadError: CUDA error: an illegal instruction was encountered (code #715, ERROR_ILLEGAL_INSTRUCTION)
Stacktrace:
 [1] #alloc#3(::CUDAdrv.Mem.CUmem_attach, ::Function, ::Int64, ::Bool) at C:\Users\Noobie\.julia\packages\CUDAdrv\JWljj\src\base.jl:147
 [2] alloc at C:\Users\Noobie\.julia\packages\CUDAdrv\JWljj\src\memory.jl:157 [inlined] (repeats 2 times)
 [3] macro expansion at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\memory.jl:248 [inlined]
 [4] macro expansion at .\util.jl:213 [inlined]
 [5] (::getfield(CuArrays, Symbol("##17#18")))() at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\memory.jl:247
 [6] lock(::getfield(CuArrays, Symbol("##17#18")), ::ReentrantLock) at .\lock.jl:101
 [7] alloc(::Int64) at .\util.jl:213
 [8] CuArray{Complex{Float64},3}(::UndefInitializer, ::Tuple{Int64,Int64,Int64}) at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:33
 [9] CuArray{Complex{Float64},N} where N(::UndefInitializer, ::Tuple{Int64,Int64,Int64}) at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:40
 [10] cuzeros(::Type, ::Tuple{Int64,Int64,Int64}) at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:186
 [11] top-level scope at none:0
in expression starting at C:\Users\Noobie\Documents\TESTJULIAGPU\inital_data.jl:31

#6

ah, ok you have more problems with your code

  1. you use k for both initial wave direction, and the index (blockIdx().y - 1) * blockDim().y + threadIdx().y
    I think this is the main problem

  2. you have a line
    blocks = ceil.(Int, (size(psi_int,3), size(psi_int,4)) ./ threads)
    but threads is not defined in your code block

also, you need blocks to be a 3-tuple. it looks like it will only be a 2-tuple in your code.

One fact of life with CUDAnative right now is cryptic errors from the compiler. Or perhaps, no error when you first run the code, which puts the GPU in a bad state, and then further calls will cause crazy behavior.

If you can’t find the bugs from staring at your code, one options is to write a snippet which encompasses the failing code in CUDA C++ and run it through nvcc or llvm to get a good compiler message. trying to take an array index from an int will definitely error!


#7

for completeness, I’ll include a snippet which doesn’t crash :slight_smile:

@inline function CUDAnative.exp(x::ComplexF64)
    scale = CUDAnative.exp( x.re )
    return ComplexF64( scale * CUDAnative.cos(x.im), scale * CUDAnative.sin(x.im) )
end

function kernel_initial_psi(a,N,k_in)
    j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    k = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    l = (blockIdx().z - 1) * blockDim().z + threadIdx().z
    if j <= size(a,1)
        a[j , k, l] = CUDAnative.exp(-im*((j-N/2)*k_in[1]+ (k-N/2)*k_in[2]+(l-N/2)*k_in[3]))
    end
    return nothing
end

N=4 # size of spatial grid
k=CuArray([1.0,1.0,1.0]) #inital wave direction
psi_int=CuArrays.cuzeros(ComplexF64, (N,N,N))
blocks = (1,1,1)
threads = (4,4,4)

@cuda blocks=blocks threads=threads  kernel_initial_psi(psi_int, N, k)

# print psi_int to stdout
println(Array(psi_int))

#8

Maybe one last question concerning this problem, I tried your code and for N>10 (of course I edited threads) it works anymore and gives me the error message

ERROR: LoadError: CUDA error: invalid argument (code #1, ERROR_INVALID_VALUE)
Stacktrace:
 [1] #_cudacall#24(::Tuple{Int64,Int64,Int64}, ::Tuple{Int64,Int64,Int64}, ::Int64, ::CuStream, ::typeof(CUDAdrv._cudacall), ::CuFunction, ::Type{Tuple{CuDeviceArray{Complex{Float64},3,CUDAnative.AS.Global},Int64,CuDeviceArray{Float64,1,CUDAnative.AS.Global},Float64}}, ::Tuple{CuDeviceArray{Complex{Float64},3,CUDAnative.AS.Global},Int64,CuDeviceArray{Float64,1,CUDAnative.AS.Global},Float64}) at /home/pallmer/.julia/packages/CUDAdrv/JWljj/src/base.jl:147
 [2] (::getfield(CUDAdrv, Symbol("#kw##_cudacall")))(::NamedTuple{(:blocks, :threads),Tuple{Tuple{Int64,Int64,Int64},Tuple{Int64,Int64,Int64}}}, ::typeof(CUDAdrv._cudacall), ::CuFunction, ::Type, ::Tuple{CuDeviceArray{Complex{Float64},3,CUDAnative.AS.Global},Int64,CuDeviceArray{Float64,1,CUDAnative.AS.Global},Float64}) at ./none:0
 [3] macro expansion at /home/pallmer/.julia/packages/CUDAdrv/JWljj/src/execution.jl:146 [inlined]
 [4] #call#112(::Base.Iterators.Pairs{Symbol,Tuple{Int64,Int64,Int64},Tuple{Symbol,Symbol},NamedTuple{(:blocks, :threads),Tuple{Tuple{Int64,Int64,Int64},Tuple{Int64,Int64,Int64}}}}, ::CUDAnative.Kernel{kernel_initial_psi,Tuple{CuDeviceArray{Complex{Float64},3,CUDAnative.AS.Global},Int64,CuDeviceArray{Float64,1,CUDAnative.AS.Global},Float64}}, ::CuDeviceArray{Complex{Float64},3,CUDAnative.AS.Global}, ::Int64, ::CuDeviceArray{Float64,1,CUDAnative.AS.Global}, ::Float64) at /home/pallmer/.julia/packages/CUDAnative/Mdd3w/src/execution.jl:283
 [5] (::getfield(CUDAnative, Symbol("#kw#Kernel")))(::NamedTuple{(:blocks, :threads),Tuple{Tuple{Int64,Int64,Int64},Tuple{Int64,Int64,Int64}}}, ::CUDAnative.Kernel{kernel_initial_psi,Tuple{CuDeviceArray{Complex{Float64},3,CUDAnative.AS.Global},Int64,CuDeviceArray{Float64,1,CUDAnative.AS.Global},Float64}}, ::CuDeviceArray{Complex{Float64},3,CUDAnative.AS.Global}, ::Vararg{Any,N} where N) at ./none:0
 [6] top-level scope at gcutils.jl:87
 [7] top-level scope at /home/pallmer/.julia/packages/CUDAnative/Mdd3w/src/execution.jl:202
in expression starting at /home/pallmer/Workspaces/Julia/Test_01/inital_data.jl:38

I looked up my deviceQuery

Device 0: "GeForce GTX TITAN"
  CUDA Driver Version / Runtime Version          5.5 / 4.2
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 6144 MBytes (6442254336 bytes)
MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)!
MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)!
  (14) Multiprocessors x ( -1) CUDA Cores/MP:    -14 CUDA Cores
  GPU Clock rate:                                876 MHz (0.88 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

and it seems to be far away from any critical bound (I also tested it on a GV100, and it didn’t work too.). What I found by looking at theNvidia-CUDA-website was that there is maybe a problem caused by my kernel function originating from my “grid definition” when I want to calculate the entries a[j, k, l] in the manner above. But I didn’t see the reason why this doesn’t work.

Unfortunately, at the moment I’m not able to test this on my workstation running a similar code in CUDA C++.


#9

this issue isn’t specific to Julia: you’re perhaps hitting a limit on threads per block, which is 1024. 11^3 > 1024, so that is not allowed if you only have a single block.

Personally, I like to just use one-dimensional block and thread groups, and have a little logic to partition what each block works on, just like programming a CPU. One gets better performance this way as well: see https://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf for instance

If you want to use 3-d blocks, then I suggest running a not too large number of blocks, like 8x8x8 or so. And having not too many threads, maybe also 8x8x8 . and having each thread fill multiple entries.

You can of course also do what you were trying to do with CUDAdrv.MAX_THREADS_PER_BLOCK , but you need to fix your bug there and make sure to really fill a 3-tuple on both blocks and threads