Organizing Threads and Block @cuda, 3D Arrays


I have a question concerning the organization of threads and blocks calling a custom CUDA kernel. First of all my workstation runs a GTX Titan, here are the corresponding specs

Device 0: "GeForce GTX TITAN"
  CUDA Driver Version / Runtime Version          5.5 / 4.2
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 6144 MBytes (6442254336 bytes)
MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)!
MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)!
  (14) Multiprocessors x ( -1) CUDA Cores/MP:    -14 CUDA Cores
  GPU Clock rate:                                876 MHz (0.88 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

The underlying issue is to generate a complex 3D (64x64x64) array, as an example lets say a simple plane wave. Therefore I’m using the following code

using CuArrays
using CUDAnative
using CUDAdrv

@inline function CUDAnative.exp(x::ComplexF64)
    scale = CUDAnative.exp( )
    return ComplexF64( scale * CUDAnative.cos(, scale * CUDAnative.sin( )

function kernel_initial_psi(a,N,k_in,dx,i)
    j = (blockIdx().x -1) * blockDim().x + threadIdx().x
    k = (blockIdx().y -1) * blockDim().y + threadIdx().y
    if j <= size(a,1)
        a[i , j, k] = CUDAnative.exp(-im*dx*((i-N/2)*k_in[1]+ (j-N/2)*k_in[2]+(k-N/2)*k_in[3]))
    return nothing

function kernel_twotothree(b, i, a)
    j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    k = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    if j <= length(a)
        b[i, j, k] = a[j, k]
    return nothing


N=64 # size of spatial grid
k=CuArray([1.0,1.0,1.0]) #inital wave direction


psi_3D=CuArrays.cuzeros(ComplexF64, (N,N,N))
psi_2D=CuArrays.cuzeros(ComplexF64, (N,N))

#Threads and Blocks-------------------------------------------------------------

blocks = (2,2)
threads = (32,32)

#initial data-------------------------------------------------------------------

for i=1:N
@cuda blocks=blocks threads=threads  kernel_initial_psi(psi_2D, N, k, dx, i)
@cuda blocks=blocks threads=threads  kernel_twotothree(psi_3D, i, psi_2D)

In theory, the strategy is the following. To produce the 3D Array I’m gonna divide it in 2D Arrays, such that each 2D Array have 64 x 64 entries. In my opinion, it is reasonable to subdivide the 2D Arrays into 4 times a 32 x 32 Array (—> use 4 blocks á 1024threads ). To produce the initial data of a 32 x 32 Array I need exactly 1024 threads (this is btw the maximum of threads per block for this GPU). The maximum amount of threads per Multiprocessor is 2048 so I will need two MPs to generate the full 64 x 64 Array. The deviceQuery above says that I have a total amount of 14 MPs on this GPU but I still get error messages like this

ERROR: LoadError: CUDA error: an illegal instruction was encountered (code #715, ERROR_ILLEGAL_INSTRUCTION)
 [1] #upload!#10(::Bool, ::Function, ::CUDAdrv.Mem.Buffer, ::Ptr{Float64}, ::Int64, ::CuStream) at C:\Users\Noobie\.julia\packages\CUDAdrv\JWljj\src\base.jl:147
 [2] upload! at C:\Users\Noobie\.julia\packages\CUDAdrv\JWljj\src\memory.jl:229 [inlined] (repeats 2 times)
 [3] unsafe_copyto! at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:161 [inlined]
 [4] copyto!(::CuArray{Float64,1}, ::Array{Float64,1}) at C:\Users\Noobie\.julia\packages\GPUArrays\t8tJB\src\abstractarray.jl:110
 [5] CuArray{Float64,1}(::Array{Float64,1}) at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:85
 [6] CuArray(::Array{Float64,1}) at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:95
 [7] top-level scope at none:0
in expression starting at C:\Users\Noobie\Documents\TESTJULIAGPU\inital_data.jl:31

Is it possible that I missed something? Or did I confound my threads/blocks organization? Im relatively new at GPU programming but I thought I got the underlying basics… which is obviously not case. It would be great if someone can tell me what is my misstake here.

So I know that is maybe not the most efficient method to generate this data (when you have performance improvement tips I’m open and thankful for your comments) but I think, at least it should work.

Further, I want to use the full potential of my GPU, so I want to run the kernel at all MPs so it must be possible to calculate 7 slices of my 3D Array at once, is this possible at all?

Have you looked into this warning? It should not say -1 CUDA Cores/MP. I imagine the problem may be with your CUDA environment rather than your code.

This should be fine, we made complete system update after that, which also includes the latest version of the CUDA Toolkit.