GPU-Kernel function for fast matrix multiplication using shared memory

[Copied from a Slack-Conversation:]

Me (Daniel):
Hello, we try to implement a matrix multiplication in a kernel function using shared memory. However, our code in the kernel_matmul_fast function calculates a wrong result vector C:

# Matrix multiplication in GPU Julia

using CUDA

""" Compute C = A * B fast using shared memory"""
function kernel_matmul_fast(C, A, B, m, p)
  tx = threadIdx().x

  sA = @cuDynamicSharedMem(Float32, (m,p))
  sB = @cuDynamicSharedMem(Float32, p)

  # Initialize shared memory for A
  for j in 1:p
    sA[tx, j] = A[tx, j]
  end

  # Initialize shared memory for B
  if tx == 1
    for j in 1:p
        sB[j] = B[j]
    end
  end

  # Wait until all threads finish preloading
  sync_threads()

  Cvalue = 0.0f0
  for i = 1:p 
    Cvalue += sA[tx, i] * sB[i]
  end
  C[tx] = Cvalue
  
  return nothing
end

""" Compute C = A * B """
function kernel_matmul(C, A, B, p)
  tx = threadIdx().x

  Cvalue = 0.0f0
  for i = 1:p 
    Cvalue += A[tx, i] * B[i]
  end
  C[tx] = Cvalue
  
  return nothing
end

m, p = 5, 5 # matrix sizes: C[m,1] = A[m,p] * B[p,1]

A, B, C = rand(Float32,m,p), rand(Float32,p), rand(Float32,m)
Ad, Bd, Cd1, Cd2 = CuArray(A), CuArray(B), CuArray(C), CuArray(C)

@cuda blocks=1 threads=m shmem=sizeof(Float32)*(m+1)*p kernel_matmul_fast(Cd1, Ad, Bd, m, p)
CUDA.synchronize()

@cuda blocks=1 threads=m kernel_matmul(Cd2, Ad, Bd, p)
CUDA.synchronize()

# CPU
C = A*B

# CuBlas
Cd3 = Ad * Bd

println("Cd1 = $Cd1 (Kernel matrix multiplication with shared memory)")
println("Cd2 = $Cd2 (Kernel matrix multiplication without shared memory)")
println("C   = $C (CPU matrix multiplication)")
println("Cd3 = $Cd3 (CuBlas matrix multiplication)")

I get the following output:

Cd1 = Float32[0.7856118, 2.1241043, 1.7347267, 1.476275, 1.474459] (Kernel matrix multiplication with shared memory)
Cd2 = Float32[0.63489825, 2.001689, 1.6415144, 1.2010169, 1.379968] (Kernel matrix multiplication without shared memory)
C   = Float32[0.63489825, 2.0016887, 1.6415144, 1.2010168, 1.379968] (CPU matrix multiplication)
Cd3 = Float32[0.6348982, 2.0016887, 1.6415144, 1.2010169, 1.3799682] (CuBlas matrix multiplication)

Since the other kernel function without shared memory kernel_matmul produces the correct result vector, I guess something with the shared memory initialization is wrong. I appreciate any help, as we have little experience developing kernel functions so far.

Tim Besard:
you need to off-set the second @cuDynamicSharedMem allocation, as it uses a single kernel-level buffer i.e.

  sA = @cuDynamicSharedMem(Float32, (m,p))
  sB = @cuDynamicSharedMem(Float32, p, sizeof(sA))

Me:
Thank you very much, now it works :slight_smile:

4 Likes

[Also copied from Slack:]

Me:
Quick follow-up question: We have implemented the same kernel_matmul_fast function in Python using Numba+Cuda. When profiling with nvprof, we measured that the Julia+Cuda implementation is about 30 percent slower than the Numba+Cuda version on a Tesla V100 GPU (13ms for Julia vs. 10ms for Python averaged over 100 kernel calls and m=p=100, blocks=112). Are there any “easy” performance optimizations that we can use to speed up the Julia implementation to reach the performance of the Numba+Cuda implementation? Again our current Julia+Cuda implementation of the kernel function:

""" Compute C = A * B """
function kernel_matmul_fast(C, A, B, m, p)
  tx = threadIdx().x

  # Important: The second @cuDynamicSharedMem allocation needs an offset of sizeof(sA), as both use a single kernel-level buffer
  sA = @cuDynamicSharedMem(Float32, (m,p))
  sB = @cuDynamicSharedMem(Float32, p, sizeof(sA))

  # Initialize shared memory for A
  for j in 1:p
    sA[tx, j] = A[tx, j]
  end

  # Initialize shared memory for B
  if tx == 1
    for j in 1:p
        sB[j] = B[j]
    end
  end

  # Wait until all threads finish preloading
  sync_threads()

  for j in 1:2000
    Cvalue = 0.0f0

    if tx <= m
      for i = 1:p 
        Cvalue += sA[tx, i] * sB[i]
      end
      C[tx] = Cvalue
    end
  end
  
  return nothing
end

This is our equivalent Python Numba+Cuda code for the performance comparison (beeing about 30 percent faster than the above Julia+Cuda implementation):

@cuda.jit
def matmul_fast(A, B, C):
    """Perform square matrix multiplication of C = A * B
    """
    tx = cuda.threadIdx.x

    sA = cuda.shared.array(shape=(m,p), dtype=float32)
    sB = cuda.shared.array(shape=p, dtype=float32)

    for j in range(p):
        sA[tx,j] = A[tx,j]

    if tx == 0:
        for j in range(p):
            sB[j] = B[j]

    # Wait until all threads finish preloading
    cuda.syncthreads()

    for j in range(2*1000):
        Cvalue = 0.0

        if tx < m:
            for i in range(p):
                Cvalue += sA[tx, i] * sB[i]

            C[tx] = Cvalue

Tim Besard:
@inbounds

FWIW, you can use a type param for the shared memory type derived from the C, A and B args

other than that, if you’re compute bound you might lose some performance over Julia’s integers being 64-bit wide, whereas Numba may be using 32-bit integers (which can execute concurrently with floating-point operations)

but you’d need to profile for that

also make sure it’s actually the kernel that’s taking 30% longer, and not just the host code

generally, low-level kernels are as fast as CUDA C, so I’d be surprised if this difference can’t be worked around

Me:
Wow 3,98 ms using @inbounds :slight_smile:

==4011247== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  398.30ms       100  3.9830ms  2.7499ms  5.2974ms  julia_kernel_matmul_fast_2131(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=2, int=1>, CuDeviceArray<Float32, int=1, int=1>, Int64, CuDeviceArray<Float32, int=1, int=1>)
                    0.00%  13.472us         4  3.3680us  1.7280us  6.8480us  [CUDA memcpy HtoD]
                    0.00%  3.1360us         1  3.1360us  3.1360us  3.1360us  [CUDA memcpy DtoH]
      API calls:   62.94%  400.00ms         1  400.00ms  400.00ms  400.00ms  cuDevicePrimaryCtxRetain
                   25.21%  160.24ms    479530     334ns     267ns  430.17us  cuStreamQuery
                    9.11%  57.922ms    479662     120ns      88ns  430.55us  cuCtxGetCurrent
                    0.96%  6.1138ms         3  2.0379ms  1.5370us  6.1054ms    
                    0.76%  4.8186ms         1  4.8186ms  4.8186ms  4.8186ms  cuModuleLoadDataEx
                    0.41%  2.6278ms       100  26.278us  4.7700us  1.9696ms  cuLaunchKernel
                    0.30%  1.8898ms         1  1.8898ms  1.8898ms  1.8898ms  cuMemHostAlloc
                    0.17%  1.0600ms         1  1.0600ms  1.0600ms  1.0600ms  cuModuleUnload
                    0.08%  490.29us         1  490.29us  490.29us  490.29us    
                    0.01%  74.922us         3  24.974us  8.3730us  45.609us  cuMemGetInfo
                    0.01%  73.957us         4  18.489us  4.7250us  34.344us  cuMemcpyHtoDAsync
                    0.01%  39.144us         1  39.144us  39.144us  39.144us  cuMemcpyDtoHAsync
                    0.01%  32.555us         1  32.555us  32.555us  32.555us  cuCtxSynchronize
                    0.00%  27.037us         1  27.037us  27.037us  27.037us  cuStreamDestroy
                    0.00%  26.756us         1  26.756us  26.756us  26.756us  cuStreamCreate
                    0.00%  26.085us         3  8.6950us  1.7840us  15.780us
                    0.00%  13.795us         5  2.7590us     222ns  10.191us  cuCtxPushCurrent
                    0.00%  11.778us         7  1.6820us     302ns  5.2360us  cuDeviceGetCount
                    0.00%  11.164us        12     930ns     148ns  5.9480us  cuDeviceGetAttribute
                    0.00%  9.5500us         4  2.3870us     353ns  4.9700us  cuPointerGetAttribute
                    0.00%  9.2110us        15     614ns      82ns  2.0580us  cuDriverGetVersion
                    0.00%  6.7300us         7     961ns     159ns  1.9960us  cuCtxSetCurrent
                    0.00%  3.3900us         3  1.1300us     807ns  1.4420us  cuDeviceGet
                    0.00%  3.0010us         9     333ns     138ns     584ns  cuCtxGetDevice
                    0.00%  2.4110us         5     482ns     193ns  1.0440us  cuCtxPopCurrent
                    0.00%  1.7210us         1  1.7210us  1.7210us  1.7210us  cuModuleGetFunction
                    0.00%     761ns         1     761ns     761ns     761ns  cuModuleGetGlobal
                    0.00%     595ns         1     595ns     595ns     595ns  cuMemHostGetDevicePointer
                    0.00%     468ns         1     468ns     468ns     468ns

I just added one @inbounds here:

 for j in 1:2000
    Cvalue = 0.0f0

    if tx <= m
      for i = 1:p
        @inbounds Cvalue += sA[tx, i] * sB[i]
      end
      C[tx] = Cvalue
    end
  end

Tim Besard:
the shared memory accesses are also bounds checked

Me:
Ok I have added some more inbounds, the perfomance remains roughly the same:

""" Compute C = A * B """
function kernel_matmul_fast(C, A, B, m, p)
  tx = threadIdx().x

  # Important: The second @cuDynamicSharedMem allocation needs an offset of sizeof(sA), as it uses a single kernel-level buffer
  sA = @cuDynamicSharedMem(Float32, (m,p))
  sB = @cuDynamicSharedMem(Float32, p, sizeof(sA))

  # Initialize shared memory for A
  for j in 1:p
     @inbounds sA[tx, j] = A[tx, j]
  end

  # Initialize shared memory for B
  if tx == 1
    for j in 1:p
       @inbounds sB[j] = B[j]
    end
  end

  # Wait until all threads finish preloading
  sync_threads()

  for j in 1:2000
    Cvalue = 0.0f0

    if tx <= m
      for i = 1:p
        @inbounds Cvalue += sA[tx, i] * sB[i]
        #@cuprintln("tx $tx, i $i, res: $(A[tx, i] * B[i])")
      end
      @inbounds C[tx] = Cvalue
      #@cuprintln(C[tx])
    end
  end

  return nothing
end

Me:
But that’s totally fine, about 4 ms per call is way faster than I expected

Tim Besard:
great! :slight_smile:

Me:
Thank you again :slight_smile:

10 Likes