Understanding Blocks and Threads

drjoke · November 22, 2017, 3:14pm

I got CUDAnative to work on a MacBook Pro 2016 on High Sierra with GeForce 1080 Ti running as an eGPU connected via USB-C.

Everything works great so far, but I need help understanding on how Blocks and Threads work in general and via CudaNative.

In the following statement,

@cuda (A, B) kernel_vadd(d_a, d_b, d_c)

I assume that A is the total number of blocks and B is the total number of threads in each block to run the executions.

It appears that the max number of thread I can specify is 1024, but I can’t figure out that max number of blocks. It is definitely higher than 65535, but lower than 65535 * 65535.

What would be the actual max number of blocks?

My other question is if num_of_blocks * num_of_threads is higher than number of cores on the GPU, the GPU will queue them. Am I correct about this assumption?

How can I find out the max number of threads that I can run concurrently on the GPU? Would that be the same as the number of cores?

maleadt · November 22, 2017, 3:37pm

Yes, the syntax is identical to CUDA’s <<<...>>> syntax (including optional arguments for shared memory and stream). Where CUDA uses dim3, you can just us a single value or a tuple of up to three values for each dimension.

Use the deviceQuery sample, or just query the API using CUDAdrv:

$ /opt/cuda-8.0/samples/1_Utilities/deviceQuery/deviceQuery                                                                                                                      
/opt/cuda-8.0/samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1080"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8114 MBytes (8508145664 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1734 MHz (1.73 GHz)
  Memory Clock rate:                             5005 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 131 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Result = PASS

julia> using CUDAdrv
julia> attribute(dev, CUDAdrv.MAX_GRID_DIM_X)
2147483647

julia> attribute(dev, CUDAdrv.MAX_GRID_DIM_Y)
65535

julia> attribute(dev, CUDAdrv.MAX_GRID_DIM_Z)
65535

Generally correct, you can calculate the maximum amount of resident threads but it isn’t always useful. It also depends on the specifics of your kernel. See eg. gpgpu - CUDA: How many concurrent threads in total? - Stack Overflow

drjoke · November 23, 2017, 6:03am

Thank you very much, Tim. Great to hear from the one who build CUDAnative. I am really looking forward to using it in my real project now.

Topic		Replies	Views
Understanding GPU Kernels GPU	4	2587	April 10, 2018
@cuda threads and blocks confusion GPU	9	3673	February 10, 2021
Synchronizing Cuda kernels GPU	5	2448	September 20, 2019
The most general way to estimate the optimal arguments for @cuda macro Performance gpu , cudanative	6	1773	April 6, 2021
Cuda - 2D and 3D grid and block dimensions General Usage question , cudanative , cuda	1	1092	July 15, 2019

Understanding Blocks and Threads

Related topics