Organizing Threads and Block @cuda, 3D Arrays

Noobie76 · March 17, 2019, 9:17pm

Hi

I have a question concerning the organization of threads and blocks calling a custom CUDA kernel. First of all my workstation runs a GTX Titan, here are the corresponding specs

Device 0: "GeForce GTX TITAN"
  CUDA Driver Version / Runtime Version          5.5 / 4.2
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 6144 MBytes (6442254336 bytes)
MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)!
MapSMtoCores SM 3.5 is undefined (please update to the latest SDK)!
  (14) Multiprocessors x ( -1) CUDA Cores/MP:    -14 CUDA Cores
  GPU Clock rate:                                876 MHz (0.88 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

The underlying issue is to generate a complex 3D (64x64x64) array, as an example lets say a simple plane wave. Therefore I’m using the following code

using CuArrays
using CUDAnative
using CUDAdrv

@inline function CUDAnative.exp(x::ComplexF64)
    scale = CUDAnative.exp( x.re )
    return ComplexF64( scale * CUDAnative.cos(x.im), scale * CUDAnative.sin(x.im) )
end

function kernel_initial_psi(a,N,k_in,dx,i)
    j = (blockIdx().x -1) * blockDim().x + threadIdx().x
    k = (blockIdx().y -1) * blockDim().y + threadIdx().y
    if j <= size(a,1)
        a[i , j, k] = CUDAnative.exp(-im*dx*((i-N/2)*k_in[1]+ (j-N/2)*k_in[2]+(k-N/2)*k_in[3]))
    end
    return nothing
end

function kernel_twotothree(b, i, a)
    j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    k = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    if j <= length(a)
        b[i, j, k] = a[j, k]
    end
    return nothing
end

#Parameters---------------------------------------------------------------------

N=64 # size of spatial grid
k=CuArray([1.0,1.0,1.0]) #inital wave direction
dx=0.1

#Initializations----------------------------------------------------------------

psi_3D=CuArrays.cuzeros(ComplexF64, (N,N,N))
psi_2D=CuArrays.cuzeros(ComplexF64, (N,N))

#Threads and Blocks-------------------------------------------------------------


blocks = (2,2)
threads = (32,32)

#initial data-------------------------------------------------------------------

for i=1:N
@cuda blocks=blocks threads=threads  kernel_initial_psi(psi_2D, N, k, dx, i)
@cuda blocks=blocks threads=threads  kernel_twotothree(psi_3D, i, psi_2D)
end

In theory, the strategy is the following. To produce the 3D Array I’m gonna divide it in 2D Arrays, such that each 2D Array have 64 x 64 entries. In my opinion, it is reasonable to subdivide the 2D Arrays into 4 times a 32 x 32 Array (—> use 4 blocks á 1024threads ). To produce the initial data of a 32 x 32 Array I need exactly 1024 threads (this is btw the maximum of threads per block for this GPU). The maximum amount of threads per Multiprocessor is 2048 so I will need two MPs to generate the full 64 x 64 Array. The deviceQuery above says that I have a total amount of 14 MPs on this GPU but I still get error messages like this

ERROR: LoadError: CUDA error: an illegal instruction was encountered (code #715, ERROR_ILLEGAL_INSTRUCTION)
Stacktrace:
 [1] #upload!#10(::Bool, ::Function, ::CUDAdrv.Mem.Buffer, ::Ptr{Float64}, ::Int64, ::CuStream) at C:\Users\Noobie\.julia\packages\CUDAdrv\JWljj\src\base.jl:147
 [2] upload! at C:\Users\Noobie\.julia\packages\CUDAdrv\JWljj\src\memory.jl:229 [inlined] (repeats 2 times)
 [3] unsafe_copyto! at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:161 [inlined]
 [4] copyto!(::CuArray{Float64,1}, ::Array{Float64,1}) at C:\Users\Noobie\.julia\packages\GPUArrays\t8tJB\src\abstractarray.jl:110
 [5] CuArray{Float64,1}(::Array{Float64,1}) at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:85
 [6] CuArray(::Array{Float64,1}) at C:\Users\Noobie\.julia\packages\CuArrays\PD3UJ\src\array.jl:95
 [7] top-level scope at none:0
in expression starting at C:\Users\Noobie\Documents\TESTJULIAGPU\inital_data.jl:31

Is it possible that I missed something? Or did I confound my threads/blocks organization? Im relatively new at GPU programming but I thought I got the underlying basics… which is obviously not case. It would be great if someone can tell me what is my misstake here.

So I know that is maybe not the most efficient method to generate this data (when you have performance improvement tips I’m open and thankful for your comments) but I think, at least it should work.

Further, I want to use the full potential of my GPU, so I want to run the kernel at all MPs so it must be possible to calculate 7 slices of my 3D Array at once, is this possible at all?

erikedin · March 18, 2019, 8:24am

Have you looked into this warning? It should not say -1 CUDA Cores/MP. I imagine the problem may be with your CUDA environment rather than your code.

Noobie76 · March 18, 2019, 9:00am

This should be fine, we made complete system update after that, which also includes the latest version of the CUDA Toolkit.

Topic		Replies	Views
Synchronizing Cuda kernels GPU	5	2479	September 20, 2019
@cuda threads and blocks confusion GPU	9	3761	February 10, 2021
Understanding Blocks and Threads GPU gpu	2	7070	November 23, 2017
CUDAnative: Using second and third dims in the kernel GPU cudanative	2	890	January 31, 2017
CUDAnative: hitting a 1024 limit when the result comes back GPU cudanative	4	1318	February 6, 2017

Organizing Threads and Block @cuda, 3D Arrays

Related topics