Hi,
The CUDA.jl tutorial has broadcasted addition, and then a series of steps to write a kernel that’s supposed to be about as fast. But on my machine, the kernel is about 10x slower, and I’d like to understand why.
Here’s the broadcasted version:
using CUDA
N = 2^20
x_d = CUDA.fill(1.0f0, N) # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CUDA.fill(2.0f0, N) # a vector stored on the GPU filled with 2.0
function add_broadcast!(y, x)
CUDA.@sync y .+= x
nothing
end
@btime add_broadcast!($y_d, $x_d)
> 15.858 ms (24 allocations: 576 bytes)
Here’s the kernel version:
threads_per_block = 128
function gpu_add3!(y, x)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
for i = index:stride:length(y)
@inbounds y[i] += x[i]
end
return
end
numblocks = ceil(Int, N/threads_per_block)
fill!(y_d, 2.0f0)
@cuda threads=threads_per_block blocks=numblocks gpu_add3!(y_d, x_d)
@test all(Array(y_d) .== 3.0f0)
function bench_gpu3!(y, x)
numblocks = ceil(Int, length(y)/threads_per_block)
CUDA.@sync begin
@cuda threads=threads_per_block blocks=numblocks gpu_add3!(y, x)
end
end
@btime bench_gpu3!($y_d, $x_d)
> 112.526 ms (41 allocations: 1.34 KiB)
I’ve tried several variations, such as increasing/decreasing the block size and increasing/decreasing the input size, but the gpu3! version seems to consistently be about 10x slower than the broadcast version.
Here’s the result of nvprof:
==2092== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 85.71% 2.0069ms 1 2.0069ms 2.0069ms 2.0069ms [CUDA memcpy DtoH]
14.01% 328.13us 3 109.38us 107.17us 113.60us julia_gpu_add3__1827(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=1, int=1>)
0.20% 4.5760us 3 1.5250us 1.4080us 1.7600us [CUDA memset]
0.08% 1.8880us 1 1.8880us 1.8880us 1.8880us [CUDA memcpy HtoD]
API calls: 93.58% 359.42ms 1 359.42ms 359.42ms 359.42ms cuDevicePrimaryCtxRetain
2.64% 10.146ms 1 10.146ms 10.146ms 10.146ms cuModuleLoadDataEx
0.99% 3.8013ms 1 3.8013ms 3.8013ms 3.8013ms cuMemcpyDtoH
0.95% 3.6356ms 3 1.2119ms 71.279us 3.4904ms cuLaunchKernel
0.76% 2.8998ms 1 2.8998ms 2.8998ms 2.8998ms cuLinkAddFile
0.36% 1.3694ms 1 1.3694ms 1.3694ms 1.3694ms cuMemHostAlloc
0.26% 980.27us 1 980.27us 980.27us 980.27us cuLinkComplete
0.14% 521.55us 2 260.77us 186.59us 334.96us cuMemAlloc
0.13% 513.75us 2 256.87us 251.49us 262.26us cuEventSynchronize
0.06% 222.38us 1 222.38us 222.38us 222.38us cuLinkAddData
0.05% 204.32us 3 68.105us 26.826us 91.459us cuMemsetD32Async
0.02% 95.414us 1 95.414us 95.414us 95.414us cuMemGetInfo
0.02% 73.536us 1 73.536us 73.536us 73.536us cuLinkCreate
0.01% 48.610us 1 48.610us 48.610us 48.610us cuMemcpyHtoD
0.01% 20.558us 11 1.8680us 753ns 4.0620us cuDeviceGet
0.00% 18.130us 2 9.0650us 8.6750us 9.4550us cuEventRecord
0.00% 15.022us 2 7.5110us 5.4400us 9.5820us cuEventCreate
0.00% 12.310us 7 1.7580us 844ns 4.0420us cuDeviceGetAttribute
0.00% 9.2760us 6 1.5460us 1.0390us 3.0210us cuCtxGetCurrent
0.00% 8.0800us 1 8.0800us 8.0800us 8.0800us cuLinkDestroy
0.00% 7.9590us 1 7.9590us 7.9590us 7.9590us cuEventDestroy
0.00% 6.3730us 1 6.3730us 6.3730us 6.3730us cuMemHostGetDevicePointer
0.00% 6.3480us 3 2.1160us 775ns 3.2900us cuDeviceGetCount
0.00% 6.0190us 1 6.0190us 6.0190us 6.0190us cuProfilerStart
0.00% 4.9870us 1 4.9870us 4.9870us 4.9870us cuDeviceGetPCIBusId
0.00% 4.6490us 1 4.6490us 4.6490us 4.6490us cuCtxSetCurrent
0.00% 2.8090us 1 2.8090us 2.8090us 2.8090us cuModuleGetGlobal
0.00% 2.2860us 1 2.2860us 2.2860us 2.2860us cuModuleGetFunction
0.00% 1.2350us 1 1.2350us 1.2350us 1.2350us cuDriverGetVersion
0.00% 1.0120us 1 1.0120us 1.0120us 1.0120us cuCtxGetDevice
With --print-gpu-trace:
==2186== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput SrcMemType DstMemType Device Context Stream Name
2.10452s 1.7600us - - - - - 4.0000MB 2219.5GB/s Device - Tesla V100-SXM2 1 14 [CUDA memset]
2.10512s 1.4080us - - - - - 4.0000MB 2774.3GB/s Device - Tesla V100-SXM2 1 14 [CUDA memset]
4.02924s 1.3760us - - - - - 4.0000MB 2838.8GB/s Device - Tesla V100-SXM2 1 14 [CUDA memset]
23.3413s 1.8880us - - - - - 8B 4.0410MB/s Pageable Device Tesla V100-SXM2 1 14 [CUDA memcpy HtoD]
24.1451s 113.57us (8192 1 1) (128 1 1) 32 0B 0B - - - - Tesla V100-SXM2 1 14 julia_gpu_add3__1827(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=1, int=1>) [48]
24.4577s 2.0052ms - - - - - 4.0000MB 1.9481GB/s Device Pageable Tesla V100-SXM2 1 14 [CUDA memcpy DtoH]
24.8362s 107.87us (8192 1 1) (128 1 1) 32 0B 0B - - - - Tesla V100-SXM2 1 14 julia_gpu_add3__1827(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=1, int=1>) [52]
25.2776s 107.36us (8192 1 1) (128 1 1) 32 0B 0B - - - - Tesla V100-SXM2 1 14 julia_gpu_add3__1827(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=1, int=1>) [59]
The broadcast nvprof:
==4072== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 1.05739s 62006 17.053us 16.640us 21.472us julia_broadcast_kernel_1889(CuKernelContext, CuDeviceArray<Float32, int=1, int=1>, Broadcasted<void, Tuple<OneTo<Int64>>, __, Broadcasted<Extruded<CuDeviceArray<Float32, int=1, int=1>, Broadcasted<Bool>, Broadcasted<OneTo>>, OneTo<Int64, CuDeviceArray<Float32, int=1, int=1>, Broadcasted<Tuple<OneTo<Int64>>>, Broadcasted<OneTo>>>>, OneTo)
0.00% 3.1360us 2 1.5680us 1.3760us 1.7600us [CUDA memset]
0.00% 1.8880us 1 1.8880us 1.8880us 1.8880us [CUDA memcpy HtoD]
API calls: 64.65% 3.58801s 62006 57.865us 5.6180us 912.53us cuEventSynchronize
16.57% 919.28ms 62006 14.825us 9.9590us 3.6430ms cuLaunchKernel
7.27% 403.37ms 1 403.37ms 403.37ms 403.37ms cuDevicePrimaryCtxRetain
4.54% 251.88ms 62006 4.0620us 3.1020us 19.332us cuEventRecord
2.47% 137.29ms 62006 2.2140us 1.0210us 1.6304ms cuEventCreate
1.86% 103.22ms 62006 1.6640us 977ns 14.659us cuOccupancyMaxPotentialBlockSize
1.23% 68.482ms 62005 1.1040us 672ns 769.06us cuEventDestroy
1.07% 59.656ms 62010 962ns 430ns 13.232us cuCtxGetCurrent
0.20% 10.981ms 1 10.981ms 10.981ms 10.981ms cuModuleLoadDataEx
0.06% 3.3327ms 1 3.3327ms 3.3327ms 3.3327ms cuLinkAddFile
0.03% 1.5934ms 1 1.5934ms 1.5934ms 1.5934ms cuMemHostAlloc
0.02% 1.1301ms 1 1.1301ms 1.1301ms 1.1301ms cuLinkComplete
0.01% 554.46us 2 277.23us 187.28us 367.18us cuMemAlloc
0.01% 304.25us 1 304.25us 304.25us 304.25us cuLinkAddData
0.00% 137.05us 2 68.523us 25.263us 111.78us cuMemsetD32Async
0.00% 84.701us 1 84.701us 84.701us 84.701us cuLinkCreate
0.00% 70.114us 1 70.114us 70.114us 70.114us cuMemcpyHtoD
0.00% 27.932us 11 2.5390us 835ns 5.6420us cuDeviceGet
0.00% 13.771us 7 1.9670us 858ns 5.7930us cuDeviceGetAttribute
0.00% 11.611us 1 11.611us 11.611us 11.611us cuMemHostGetDevicePointer
0.00% 11.318us 1 11.318us 11.318us 11.318us cuLinkDestroy
0.00% 9.5930us 3 3.1970us 847ns 4.4120us cuDeviceGetCount
0.00% 8.4260us 1 8.4260us 8.4260us 8.4260us cuCtxSetCurrent
0.00% 7.1590us 1 7.1590us 7.1590us 7.1590us cuProfilerStart
0.00% 7.0390us 1 7.0390us 7.0390us 7.0390us cuDeviceGetPCIBusId
0.00% 3.6610us 1 3.6610us 3.6610us 3.6610us cuModuleGetGlobal
0.00% 3.2400us 1 3.2400us 3.2400us 3.2400us cuModuleGetFunction
0.00% 1.5470us 1 1.5470us 1.5470us 1.5470us cuDriverGetVersion
0.00% 1.4130us 1 1.4130us 1.4130us 1.4130us cuCtxGetDevice
The trace from the broadcast version is extremely long, but has a lot of lines that look like this:
37.8500s 17.311us (4096 1 1) (256 1 1) 34 0B 0B - - - - Tesla V100-SXM2 1 14 julia_broadcast_kernel_1889(CuKernelContext, CuDeviceArray<Float32, int=1, int=1>, Broadcasted<void, Tuple<OneTo<Int64>>, __, Broadcasted<Extruded<CuDeviceArray<Float32, int=1, int=1>, Broadcasted<Bool>, Broadcasted<OneTo>>, OneTo<Int64, CuDeviceArray<Float32, int=1, int=1>, Broadcasted<Tuple<OneTo<Int64>>>, Broadcasted<OneTo>>>>, OneTo) [445460]
The machine is an AWS p3.2xlarge with an NVIDIA Tesla V100 GPU, in case that’s relevant.