[Also copied from Slack:]
Me:
Quick follow-up question: We have implemented the same kernel_matmul_fast
function in Python using Numba+Cuda. When profiling with nvprof, we measured that the Julia+Cuda implementation is about 30 percent slower than the Numba+Cuda version on a Tesla V100 GPU (13ms for Julia vs. 10ms for Python averaged over 100 kernel calls and m=p=100, blocks=112). Are there any “easy” performance optimizations that we can use to speed up the Julia implementation to reach the performance of the Numba+Cuda implementation? Again our current Julia+Cuda implementation of the kernel function:
""" Compute C = A * B """
function kernel_matmul_fast(C, A, B, m, p)
tx = threadIdx().x
# Important: The second @cuDynamicSharedMem allocation needs an offset of sizeof(sA), as both use a single kernel-level buffer
sA = @cuDynamicSharedMem(Float32, (m,p))
sB = @cuDynamicSharedMem(Float32, p, sizeof(sA))
# Initialize shared memory for A
for j in 1:p
sA[tx, j] = A[tx, j]
end
# Initialize shared memory for B
if tx == 1
for j in 1:p
sB[j] = B[j]
end
end
# Wait until all threads finish preloading
sync_threads()
for j in 1:2000
Cvalue = 0.0f0
if tx <= m
for i = 1:p
Cvalue += sA[tx, i] * sB[i]
end
C[tx] = Cvalue
end
end
return nothing
end
This is our equivalent Python Numba+Cuda code for the performance comparison (beeing about 30 percent faster than the above Julia+Cuda implementation):
@cuda.jit
def matmul_fast(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
tx = cuda.threadIdx.x
sA = cuda.shared.array(shape=(m,p), dtype=float32)
sB = cuda.shared.array(shape=p, dtype=float32)
for j in range(p):
sA[tx,j] = A[tx,j]
if tx == 0:
for j in range(p):
sB[j] = B[j]
# Wait until all threads finish preloading
cuda.syncthreads()
for j in range(2*1000):
Cvalue = 0.0
if tx < m:
for i in range(p):
Cvalue += sA[tx, i] * sB[i]
C[tx] = Cvalue
Tim Besard:
@inbounds
FWIW, you can use a type param for the shared memory type derived from the C, A and B args
other than that, if you’re compute bound you might lose some performance over Julia’s integers being 64-bit wide, whereas Numba may be using 32-bit integers (which can execute concurrently with floating-point operations)
but you’d need to profile for that
also make sure it’s actually the kernel that’s taking 30% longer, and not just the host code
generally, low-level kernels are as fast as CUDA C, so I’d be surprised if this difference can’t be worked around
Me:
Wow 3,98 ms using @inbounds
==4011247== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 398.30ms 100 3.9830ms 2.7499ms 5.2974ms julia_kernel_matmul_fast_2131(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=2, int=1>, CuDeviceArray<Float32, int=1, int=1>, Int64, CuDeviceArray<Float32, int=1, int=1>)
0.00% 13.472us 4 3.3680us 1.7280us 6.8480us [CUDA memcpy HtoD]
0.00% 3.1360us 1 3.1360us 3.1360us 3.1360us [CUDA memcpy DtoH]
API calls: 62.94% 400.00ms 1 400.00ms 400.00ms 400.00ms cuDevicePrimaryCtxRetain
25.21% 160.24ms 479530 334ns 267ns 430.17us cuStreamQuery
9.11% 57.922ms 479662 120ns 88ns 430.55us cuCtxGetCurrent
0.96% 6.1138ms 3 2.0379ms 1.5370us 6.1054ms
0.76% 4.8186ms 1 4.8186ms 4.8186ms 4.8186ms cuModuleLoadDataEx
0.41% 2.6278ms 100 26.278us 4.7700us 1.9696ms cuLaunchKernel
0.30% 1.8898ms 1 1.8898ms 1.8898ms 1.8898ms cuMemHostAlloc
0.17% 1.0600ms 1 1.0600ms 1.0600ms 1.0600ms cuModuleUnload
0.08% 490.29us 1 490.29us 490.29us 490.29us
0.01% 74.922us 3 24.974us 8.3730us 45.609us cuMemGetInfo
0.01% 73.957us 4 18.489us 4.7250us 34.344us cuMemcpyHtoDAsync
0.01% 39.144us 1 39.144us 39.144us 39.144us cuMemcpyDtoHAsync
0.01% 32.555us 1 32.555us 32.555us 32.555us cuCtxSynchronize
0.00% 27.037us 1 27.037us 27.037us 27.037us cuStreamDestroy
0.00% 26.756us 1 26.756us 26.756us 26.756us cuStreamCreate
0.00% 26.085us 3 8.6950us 1.7840us 15.780us
0.00% 13.795us 5 2.7590us 222ns 10.191us cuCtxPushCurrent
0.00% 11.778us 7 1.6820us 302ns 5.2360us cuDeviceGetCount
0.00% 11.164us 12 930ns 148ns 5.9480us cuDeviceGetAttribute
0.00% 9.5500us 4 2.3870us 353ns 4.9700us cuPointerGetAttribute
0.00% 9.2110us 15 614ns 82ns 2.0580us cuDriverGetVersion
0.00% 6.7300us 7 961ns 159ns 1.9960us cuCtxSetCurrent
0.00% 3.3900us 3 1.1300us 807ns 1.4420us cuDeviceGet
0.00% 3.0010us 9 333ns 138ns 584ns cuCtxGetDevice
0.00% 2.4110us 5 482ns 193ns 1.0440us cuCtxPopCurrent
0.00% 1.7210us 1 1.7210us 1.7210us 1.7210us cuModuleGetFunction
0.00% 761ns 1 761ns 761ns 761ns cuModuleGetGlobal
0.00% 595ns 1 595ns 595ns 595ns cuMemHostGetDevicePointer
0.00% 468ns 1 468ns 468ns 468ns
I just added one @inbounds
here:
for j in 1:2000
Cvalue = 0.0f0
if tx <= m
for i = 1:p
@inbounds Cvalue += sA[tx, i] * sB[i]
end
C[tx] = Cvalue
end
end
Tim Besard:
the shared memory accesses are also bounds checked
Me:
Ok I have added some more inbounds, the perfomance remains roughly the same:
""" Compute C = A * B """
function kernel_matmul_fast(C, A, B, m, p)
tx = threadIdx().x
# Important: The second @cuDynamicSharedMem allocation needs an offset of sizeof(sA), as it uses a single kernel-level buffer
sA = @cuDynamicSharedMem(Float32, (m,p))
sB = @cuDynamicSharedMem(Float32, p, sizeof(sA))
# Initialize shared memory for A
for j in 1:p
@inbounds sA[tx, j] = A[tx, j]
end
# Initialize shared memory for B
if tx == 1
for j in 1:p
@inbounds sB[j] = B[j]
end
end
# Wait until all threads finish preloading
sync_threads()
for j in 1:2000
Cvalue = 0.0f0
if tx <= m
for i = 1:p
@inbounds Cvalue += sA[tx, i] * sB[i]
#@cuprintln("tx $tx, i $i, res: $(A[tx, i] * B[i])")
end
@inbounds C[tx] = Cvalue
#@cuprintln(C[tx])
end
end
return nothing
end
Me:
But that’s totally fine, about 4 ms per call is way faster than I expected
Tim Besard:
great!
Me:
Thank you again