CUDA.jl tutorial code kernel slower than broadcast

Satvik · March 30, 2021, 3:13pm

Hi,

The CUDA.jl tutorial has broadcasted addition, and then a series of steps to write a kernel that’s supposed to be about as fast. But on my machine, the kernel is about 10x slower, and I’d like to understand why.

Here’s the broadcasted version:

using CUDA
N = 2^20
x_d = CUDA.fill(1.0f0, N)  # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CUDA.fill(2.0f0, N)  # a vector stored on the GPU filled with 2.0

function add_broadcast!(y, x)
    CUDA.@sync y .+= x
    nothing
end

@btime add_broadcast!($y_d, $x_d)
>      15.858 ms (24 allocations: 576 bytes)

Here’s the kernel version:

threads_per_block = 128
function gpu_add3!(y, x)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i = index:stride:length(y)
        @inbounds y[i] += x[i]
    end
    return
end

numblocks = ceil(Int, N/threads_per_block)

fill!(y_d, 2.0f0)
@cuda threads=threads_per_block blocks=numblocks gpu_add3!(y_d, x_d)
@test all(Array(y_d) .== 3.0f0)

function bench_gpu3!(y, x)
    numblocks = ceil(Int, length(y)/threads_per_block)
    CUDA.@sync begin
        @cuda threads=threads_per_block blocks=numblocks gpu_add3!(y, x)
    end
end

@btime bench_gpu3!($y_d, $x_d)
>   112.526 ms (41 allocations: 1.34 KiB)

I’ve tried several variations, such as increasing/decreasing the block size and increasing/decreasing the input size, but the gpu3! version seems to consistently be about 10x slower than the broadcast version.

Here’s the result of nvprof:

==2092== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   85.71%  2.0069ms         1  2.0069ms  2.0069ms  2.0069ms  [CUDA memcpy DtoH]
                   14.01%  328.13us         3  109.38us  107.17us  113.60us  julia_gpu_add3__1827(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=1, int=1>)
                    0.20%  4.5760us         3  1.5250us  1.4080us  1.7600us  [CUDA memset]
                    0.08%  1.8880us         1  1.8880us  1.8880us  1.8880us  [CUDA memcpy HtoD]
      API calls:   93.58%  359.42ms         1  359.42ms  359.42ms  359.42ms  cuDevicePrimaryCtxRetain
                    2.64%  10.146ms         1  10.146ms  10.146ms  10.146ms  cuModuleLoadDataEx
                    0.99%  3.8013ms         1  3.8013ms  3.8013ms  3.8013ms  cuMemcpyDtoH
                    0.95%  3.6356ms         3  1.2119ms  71.279us  3.4904ms  cuLaunchKernel
                    0.76%  2.8998ms         1  2.8998ms  2.8998ms  2.8998ms  cuLinkAddFile
                    0.36%  1.3694ms         1  1.3694ms  1.3694ms  1.3694ms  cuMemHostAlloc
                    0.26%  980.27us         1  980.27us  980.27us  980.27us  cuLinkComplete
                    0.14%  521.55us         2  260.77us  186.59us  334.96us  cuMemAlloc
                    0.13%  513.75us         2  256.87us  251.49us  262.26us  cuEventSynchronize
                    0.06%  222.38us         1  222.38us  222.38us  222.38us  cuLinkAddData
                    0.05%  204.32us         3  68.105us  26.826us  91.459us  cuMemsetD32Async
                    0.02%  95.414us         1  95.414us  95.414us  95.414us  cuMemGetInfo
                    0.02%  73.536us         1  73.536us  73.536us  73.536us  cuLinkCreate
                    0.01%  48.610us         1  48.610us  48.610us  48.610us  cuMemcpyHtoD
                    0.01%  20.558us        11  1.8680us     753ns  4.0620us  cuDeviceGet
                    0.00%  18.130us         2  9.0650us  8.6750us  9.4550us  cuEventRecord
                    0.00%  15.022us         2  7.5110us  5.4400us  9.5820us  cuEventCreate
                    0.00%  12.310us         7  1.7580us     844ns  4.0420us  cuDeviceGetAttribute
                    0.00%  9.2760us         6  1.5460us  1.0390us  3.0210us  cuCtxGetCurrent
                    0.00%  8.0800us         1  8.0800us  8.0800us  8.0800us  cuLinkDestroy
                    0.00%  7.9590us         1  7.9590us  7.9590us  7.9590us  cuEventDestroy
                    0.00%  6.3730us         1  6.3730us  6.3730us  6.3730us  cuMemHostGetDevicePointer
                    0.00%  6.3480us         3  2.1160us     775ns  3.2900us  cuDeviceGetCount
                    0.00%  6.0190us         1  6.0190us  6.0190us  6.0190us  cuProfilerStart
                    0.00%  4.9870us         1  4.9870us  4.9870us  4.9870us  cuDeviceGetPCIBusId
                    0.00%  4.6490us         1  4.6490us  4.6490us  4.6490us  cuCtxSetCurrent
                    0.00%  2.8090us         1  2.8090us  2.8090us  2.8090us  cuModuleGetGlobal
                    0.00%  2.2860us         1  2.2860us  2.2860us  2.2860us  cuModuleGetFunction
                    0.00%  1.2350us         1  1.2350us  1.2350us  1.2350us  cuDriverGetVersion
                    0.00%  1.0120us         1  1.0120us  1.0120us  1.0120us  cuCtxGetDevice

With --print-gpu-trace:

==2186== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
2.10452s  1.7600us                    -               -         -         -         -  4.0000MB  2219.5GB/s      Device           -  Tesla V100-SXM2         1        14  [CUDA memset]
2.10512s  1.4080us                    -               -         -         -         -  4.0000MB  2774.3GB/s      Device           -  Tesla V100-SXM2         1        14  [CUDA memset]
4.02924s  1.3760us                    -               -         -         -         -  4.0000MB  2838.8GB/s      Device           -  Tesla V100-SXM2         1        14  [CUDA memset]
23.3413s  1.8880us                    -               -         -         -         -        8B  4.0410MB/s    Pageable      Device  Tesla V100-SXM2         1        14  [CUDA memcpy HtoD]
24.1451s  113.57us           (8192 1 1)       (128 1 1)        32        0B        0B         -           -           -           -  Tesla V100-SXM2         1        14  julia_gpu_add3__1827(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=1, int=1>) [48]
24.4577s  2.0052ms                    -               -         -         -         -  4.0000MB  1.9481GB/s      Device    Pageable  Tesla V100-SXM2         1        14  [CUDA memcpy DtoH]
24.8362s  107.87us           (8192 1 1)       (128 1 1)        32        0B        0B         -           -           -           -  Tesla V100-SXM2         1        14  julia_gpu_add3__1827(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=1, int=1>) [52]
25.2776s  107.36us           (8192 1 1)       (128 1 1)        32        0B        0B         -           -           -           -  Tesla V100-SXM2         1        14  julia_gpu_add3__1827(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=1, int=1>) [59]

The broadcast nvprof:

==4072== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.05739s     62006  17.053us  16.640us  21.472us  julia_broadcast_kernel_1889(CuKernelContext, CuDeviceArray<Float32, int=1, int=1>, Broadcasted<void, Tuple<OneTo<Int64>>, __, Broadcasted<Extruded<CuDeviceArray<Float32, int=1, int=1>, Broadcasted<Bool>, Broadcasted<OneTo>>, OneTo<Int64, CuDeviceArray<Float32, int=1, int=1>, Broadcasted<Tuple<OneTo<Int64>>>, Broadcasted<OneTo>>>>, OneTo)
                    0.00%  3.1360us         2  1.5680us  1.3760us  1.7600us  [CUDA memset]
                    0.00%  1.8880us         1  1.8880us  1.8880us  1.8880us  [CUDA memcpy HtoD]
      API calls:   64.65%  3.58801s     62006  57.865us  5.6180us  912.53us  cuEventSynchronize
                   16.57%  919.28ms     62006  14.825us  9.9590us  3.6430ms  cuLaunchKernel
                    7.27%  403.37ms         1  403.37ms  403.37ms  403.37ms  cuDevicePrimaryCtxRetain
                    4.54%  251.88ms     62006  4.0620us  3.1020us  19.332us  cuEventRecord
                    2.47%  137.29ms     62006  2.2140us  1.0210us  1.6304ms  cuEventCreate
                    1.86%  103.22ms     62006  1.6640us     977ns  14.659us  cuOccupancyMaxPotentialBlockSize
                    1.23%  68.482ms     62005  1.1040us     672ns  769.06us  cuEventDestroy
                    1.07%  59.656ms     62010     962ns     430ns  13.232us  cuCtxGetCurrent
                    0.20%  10.981ms         1  10.981ms  10.981ms  10.981ms  cuModuleLoadDataEx
                    0.06%  3.3327ms         1  3.3327ms  3.3327ms  3.3327ms  cuLinkAddFile
                    0.03%  1.5934ms         1  1.5934ms  1.5934ms  1.5934ms  cuMemHostAlloc
                    0.02%  1.1301ms         1  1.1301ms  1.1301ms  1.1301ms  cuLinkComplete
                    0.01%  554.46us         2  277.23us  187.28us  367.18us  cuMemAlloc
                    0.01%  304.25us         1  304.25us  304.25us  304.25us  cuLinkAddData
                    0.00%  137.05us         2  68.523us  25.263us  111.78us  cuMemsetD32Async
                    0.00%  84.701us         1  84.701us  84.701us  84.701us  cuLinkCreate
                    0.00%  70.114us         1  70.114us  70.114us  70.114us  cuMemcpyHtoD
                    0.00%  27.932us        11  2.5390us     835ns  5.6420us  cuDeviceGet
                    0.00%  13.771us         7  1.9670us     858ns  5.7930us  cuDeviceGetAttribute
                    0.00%  11.611us         1  11.611us  11.611us  11.611us  cuMemHostGetDevicePointer
                    0.00%  11.318us         1  11.318us  11.318us  11.318us  cuLinkDestroy
                    0.00%  9.5930us         3  3.1970us     847ns  4.4120us  cuDeviceGetCount
                    0.00%  8.4260us         1  8.4260us  8.4260us  8.4260us  cuCtxSetCurrent
                    0.00%  7.1590us         1  7.1590us  7.1590us  7.1590us  cuProfilerStart
                    0.00%  7.0390us         1  7.0390us  7.0390us  7.0390us  cuDeviceGetPCIBusId
                    0.00%  3.6610us         1  3.6610us  3.6610us  3.6610us  cuModuleGetGlobal
                    0.00%  3.2400us         1  3.2400us  3.2400us  3.2400us  cuModuleGetFunction
                    0.00%  1.5470us         1  1.5470us  1.5470us  1.5470us  cuDriverGetVersion
                    0.00%  1.4130us         1  1.4130us  1.4130us  1.4130us  cuCtxGetDevice

The trace from the broadcast version is extremely long, but has a lot of lines that look like this:

37.8500s  17.311us           (4096 1 1)       (256 1 1)        34        0B        0B         -           -           -           -  Tesla V100-SXM2         1        14  julia_broadcast_kernel_1889(CuKernelContext, CuDeviceArray<Float32, int=1, int=1>, Broadcasted<void, Tuple<OneTo<Int64>>, __, Broadcasted<Extruded<CuDeviceArray<Float32, int=1, int=1>, Broadcasted<Bool>, Broadcasted<OneTo>>, OneTo<Int64, CuDeviceArray<Float32, int=1, int=1>, Broadcasted<Tuple<OneTo<Int64>>>, Broadcasted<OneTo>>>>, OneTo) [445460]

The machine is an AWS p3.2xlarge with an NVIDIA Tesla V100 GPU, in case that’s relevant.

maleadt · March 30, 2021, 3:18pm

You’re using 128 threads, while the broadcast kernel uses 256. Try using the occupancy API.

Satvik · March 30, 2021, 3:19pm

Thanks for the pointer to Occupancy, I’ll definitely try it. Just FYI, I tried with 256, 512, and 1024 threads as well and got almost the same results.

maleadt · March 30, 2021, 3:26pm

The stride is unnecessary, right, since you’re launching as many threads as there’s elements? That wouldn’t explain a 10x though. These are just some quick suggestions, I can have a better look later.

Satvik · March 30, 2021, 4:05pm

Using the occupancy API made it 2x faster, bringing the difference down to 5x.

Removing the stride made it another 5x faster, bridging the gap! Thanks!

Topic		Replies	Views
Is is possible to merge multiple kernels in CUDAnative to minimize launch overhead and execution overhead? GPU	12	1656	November 11, 2018
GPU kernel optimization (GPU vs CPU) GPU	3	1534	December 14, 2018
CUDA v2 - performance regression on matrix multiplication GPU	14	1776	November 10, 2020
Comparing Python, Julia, and C++ Performance broadcast , python	21	33814	November 1, 2018
CUDA arrays not working well with broadcast!(), and other in-place operations inside a loop GPU gpu , broadcast , loops	4	766	June 1, 2022

CUDA.jl tutorial code kernel slower than broadcast

Related topics