I got CUDAnative to work on a MacBook Pro 2016 on High Sierra with GeForce 1080 Ti running as an eGPU connected via USB-C.
Everything works great so far, but I need help understanding on how Blocks and Threads work in general and via CudaNative.
In the following statement,
@cuda (A, B) kernel_vadd(d_a, d_b, d_c)
I assume that A is the total number of blocks and B is the total number of threads in each block to run the executions.
It appears that the max number of thread I can specify is 1024, but I can’t figure out that max number of blocks. It is definitely higher than 65535, but lower than 65535 * 65535.
What would be the actual max number of blocks?
My other question is if num_of_blocks * num_of_threads is higher than number of cores on the GPU, the GPU will queue them. Am I correct about this assumption?
How can I find out the max number of threads that I can run concurrently on the GPU? Would that be the same as the number of cores?
Yes, the syntax is identical to CUDA’s <<<...>>> syntax (including optional arguments for shared memory and stream). Where CUDA uses dim3, you can just us a single value or a tuple of up to three values for each dimension.
Use the deviceQuery sample, or just query the API using CUDAdrv:
$ /opt/cuda-8.0/samples/1_Utilities/deviceQuery/deviceQuery
/opt/cuda-8.0/samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8114 MBytes (8508145664 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1734 MHz (1.73 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 131 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Result = PASS