# GPU-Kernel function for fast matrix multiplication using shared memory

[Copied from a Slack-Conversation:]

Me (Daniel):
Hello, we try to implement a matrix multiplication in a kernel function using shared memory. However, our code in the `kernel_matmul_fast` function calculates a wrong result vector C:

``````# Matrix multiplication in GPU Julia

using CUDA

""" Compute C = A * B fast using shared memory"""
function kernel_matmul_fast(C, A, B, m, p)

sA = @cuDynamicSharedMem(Float32, (m,p))
sB = @cuDynamicSharedMem(Float32, p)

# Initialize shared memory for A
for j in 1:p
sA[tx, j] = A[tx, j]
end

# Initialize shared memory for B
if tx == 1
for j in 1:p
sB[j] = B[j]
end
end

Cvalue = 0.0f0
for i = 1:p
Cvalue += sA[tx, i] * sB[i]
end
C[tx] = Cvalue

return nothing
end

""" Compute C = A * B """
function kernel_matmul(C, A, B, p)

Cvalue = 0.0f0
for i = 1:p
Cvalue += A[tx, i] * B[i]
end
C[tx] = Cvalue

return nothing
end

m, p = 5, 5 # matrix sizes: C[m,1] = A[m,p] * B[p,1]

A, B, C = rand(Float32,m,p), rand(Float32,p), rand(Float32,m)
Ad, Bd, Cd1, Cd2 = CuArray(A), CuArray(B), CuArray(C), CuArray(C)

CUDA.synchronize()

CUDA.synchronize()

# CPU
C = A*B

# CuBlas

println("Cd1 = \$Cd1 (Kernel matrix multiplication with shared memory)")
println("Cd2 = \$Cd2 (Kernel matrix multiplication without shared memory)")
println("C   = \$C (CPU matrix multiplication)")
println("Cd3 = \$Cd3 (CuBlas matrix multiplication)")
``````

I get the following output:

``````Cd1 = Float32[0.7856118, 2.1241043, 1.7347267, 1.476275, 1.474459] (Kernel matrix multiplication with shared memory)
Cd2 = Float32[0.63489825, 2.001689, 1.6415144, 1.2010169, 1.379968] (Kernel matrix multiplication without shared memory)
C   = Float32[0.63489825, 2.0016887, 1.6415144, 1.2010168, 1.379968] (CPU matrix multiplication)
Cd3 = Float32[0.6348982, 2.0016887, 1.6415144, 1.2010169, 1.3799682] (CuBlas matrix multiplication)
``````

Since the other kernel function without shared memory `kernel_matmul` produces the correct result vector, I guess something with the shared memory initialization is wrong. I appreciate any help, as we have little experience developing kernel functions so far.

Tim Besard:
you need to off-set the second `@cuDynamicSharedMem` allocation, as it uses a single kernel-level buffer i.e.

``````  sA = @cuDynamicSharedMem(Float32, (m,p))
sB = @cuDynamicSharedMem(Float32, p, sizeof(sA))
``````

Me:
Thank you very much, now it works

4 Likes

[Also copied from Slack:]

Me:
Quick follow-up question: We have implemented the same `kernel_matmul_fast` function in Python using Numba+Cuda. When profiling with nvprof, we measured that the Julia+Cuda implementation is about 30 percent slower than the Numba+Cuda version on a Tesla V100 GPU (13ms for Julia vs. 10ms for Python averaged over 100 kernel calls and m=p=100, blocks=112). Are there any â€śeasyâ€ť performance optimizations that we can use to speed up the Julia implementation to reach the performance of the Numba+Cuda implementation? Again our current Julia+Cuda implementation of the kernel function:

``````""" Compute C = A * B """
function kernel_matmul_fast(C, A, B, m, p)

# Important: The second @cuDynamicSharedMem allocation needs an offset of sizeof(sA), as both use a single kernel-level buffer
sA = @cuDynamicSharedMem(Float32, (m,p))
sB = @cuDynamicSharedMem(Float32, p, sizeof(sA))

# Initialize shared memory for A
for j in 1:p
sA[tx, j] = A[tx, j]
end

# Initialize shared memory for B
if tx == 1
for j in 1:p
sB[j] = B[j]
end
end

for j in 1:2000
Cvalue = 0.0f0

if tx <= m
for i = 1:p
Cvalue += sA[tx, i] * sB[i]
end
C[tx] = Cvalue
end
end

return nothing
end
``````

This is our equivalent Python Numba+Cuda code for the performance comparison (beeing about 30 percent faster than the above Julia+Cuda implementation):

``````@cuda.jit
def matmul_fast(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""

sA = cuda.shared.array(shape=(m,p), dtype=float32)
sB = cuda.shared.array(shape=p, dtype=float32)

for j in range(p):
sA[tx,j] = A[tx,j]

if tx == 0:
for j in range(p):
sB[j] = B[j]

for j in range(2*1000):
Cvalue = 0.0

if tx < m:
for i in range(p):
Cvalue += sA[tx, i] * sB[i]

C[tx] = Cvalue
``````

Tim Besard:
`@inbounds`

FWIW, you can use a type param for the shared memory type derived from the C, A and B args

other than that, if youâ€™re compute bound you might lose some performance over Juliaâ€™s integers being 64-bit wide, whereas Numba may be using 32-bit integers (which can execute concurrently with floating-point operations)

but youâ€™d need to profile for that

also make sure itâ€™s actually the kernel thatâ€™s taking 30% longer, and not just the host code

generally, low-level kernels are as fast as CUDA C, so Iâ€™d be surprised if this difference canâ€™t be worked around

Me:
Wow 3,98 ms using @inbounds

``````==4011247== Profiling result:
Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:  100.00%  398.30ms       100  3.9830ms  2.7499ms  5.2974ms  julia_kernel_matmul_fast_2131(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=2, int=1>, CuDeviceArray<Float32, int=1, int=1>, Int64, CuDeviceArray<Float32, int=1, int=1>)
0.00%  13.472us         4  3.3680us  1.7280us  6.8480us  [CUDA memcpy HtoD]
0.00%  3.1360us         1  3.1360us  3.1360us  3.1360us  [CUDA memcpy DtoH]
API calls:   62.94%  400.00ms         1  400.00ms  400.00ms  400.00ms  cuDevicePrimaryCtxRetain
25.21%  160.24ms    479530     334ns     267ns  430.17us  cuStreamQuery
9.11%  57.922ms    479662     120ns      88ns  430.55us  cuCtxGetCurrent
0.96%  6.1138ms         3  2.0379ms  1.5370us  6.1054ms
0.76%  4.8186ms         1  4.8186ms  4.8186ms  4.8186ms  cuModuleLoadDataEx
0.41%  2.6278ms       100  26.278us  4.7700us  1.9696ms  cuLaunchKernel
0.30%  1.8898ms         1  1.8898ms  1.8898ms  1.8898ms  cuMemHostAlloc
0.17%  1.0600ms         1  1.0600ms  1.0600ms  1.0600ms  cuModuleUnload
0.08%  490.29us         1  490.29us  490.29us  490.29us
0.01%  74.922us         3  24.974us  8.3730us  45.609us  cuMemGetInfo
0.01%  73.957us         4  18.489us  4.7250us  34.344us  cuMemcpyHtoDAsync
0.01%  39.144us         1  39.144us  39.144us  39.144us  cuMemcpyDtoHAsync
0.01%  32.555us         1  32.555us  32.555us  32.555us  cuCtxSynchronize
0.00%  27.037us         1  27.037us  27.037us  27.037us  cuStreamDestroy
0.00%  26.756us         1  26.756us  26.756us  26.756us  cuStreamCreate
0.00%  26.085us         3  8.6950us  1.7840us  15.780us
0.00%  13.795us         5  2.7590us     222ns  10.191us  cuCtxPushCurrent
0.00%  11.778us         7  1.6820us     302ns  5.2360us  cuDeviceGetCount
0.00%  11.164us        12     930ns     148ns  5.9480us  cuDeviceGetAttribute
0.00%  9.5500us         4  2.3870us     353ns  4.9700us  cuPointerGetAttribute
0.00%  9.2110us        15     614ns      82ns  2.0580us  cuDriverGetVersion
0.00%  6.7300us         7     961ns     159ns  1.9960us  cuCtxSetCurrent
0.00%  3.3900us         3  1.1300us     807ns  1.4420us  cuDeviceGet
0.00%  3.0010us         9     333ns     138ns     584ns  cuCtxGetDevice
0.00%  2.4110us         5     482ns     193ns  1.0440us  cuCtxPopCurrent
0.00%  1.7210us         1  1.7210us  1.7210us  1.7210us  cuModuleGetFunction
0.00%     761ns         1     761ns     761ns     761ns  cuModuleGetGlobal
0.00%     595ns         1     595ns     595ns     595ns  cuMemHostGetDevicePointer
0.00%     468ns         1     468ns     468ns     468ns
``````

I just added one `@inbounds` here:

`````` for j in 1:2000
Cvalue = 0.0f0

if tx <= m
for i = 1:p
@inbounds Cvalue += sA[tx, i] * sB[i]
end
C[tx] = Cvalue
end
end
``````

Tim Besard:
the shared memory accesses are also bounds checked

Me:
Ok I have added some more inbounds, the perfomance remains roughly the same:

``````""" Compute C = A * B """
function kernel_matmul_fast(C, A, B, m, p)

# Important: The second @cuDynamicSharedMem allocation needs an offset of sizeof(sA), as it uses a single kernel-level buffer
sA = @cuDynamicSharedMem(Float32, (m,p))
sB = @cuDynamicSharedMem(Float32, p, sizeof(sA))

# Initialize shared memory for A
for j in 1:p
@inbounds sA[tx, j] = A[tx, j]
end

# Initialize shared memory for B
if tx == 1
for j in 1:p
@inbounds sB[j] = B[j]
end
end

for j in 1:2000
Cvalue = 0.0f0

if tx <= m
for i = 1:p
@inbounds Cvalue += sA[tx, i] * sB[i]
#@cuprintln("tx \$tx, i \$i, res: \$(A[tx, i] * B[i])")
end
@inbounds C[tx] = Cvalue
#@cuprintln(C[tx])
end
end

return nothing
end
``````

Me:
But thatâ€™s totally fine, about 4 ms per call is way faster than I expected

Tim Besard:
great!

Me:
Thank you again

9 Likes