[Also copied from Slack:]
Me:
Quick follow-up question: We have implemented the same kernel_matmul_fast function in Python using Numba+Cuda. When profiling with nvprof, we measured that the Julia+Cuda implementation is about 30 percent slower than the Numba+Cuda version on a Tesla V100 GPU (13ms for Julia vs. 10ms for Python averaged over 100 kernel calls and m=p=100, blocks=112). Are there any “easy” performance optimizations that we can use to speed up the Julia implementation to reach the performance of the Numba+Cuda implementation? Again our current Julia+Cuda implementation of the kernel function:
""" Compute C = A * B """
function kernel_matmul_fast(C, A, B, m, p)
  tx = threadIdx().x
  # Important: The second @cuDynamicSharedMem allocation needs an offset of sizeof(sA), as both use a single kernel-level buffer
  sA = @cuDynamicSharedMem(Float32, (m,p))
  sB = @cuDynamicSharedMem(Float32, p, sizeof(sA))
  # Initialize shared memory for A
  for j in 1:p
    sA[tx, j] = A[tx, j]
  end
  # Initialize shared memory for B
  if tx == 1
    for j in 1:p
        sB[j] = B[j]
    end
  end
  # Wait until all threads finish preloading
  sync_threads()
  for j in 1:2000
    Cvalue = 0.0f0
    if tx <= m
      for i = 1:p 
        Cvalue += sA[tx, i] * sB[i]
      end
      C[tx] = Cvalue
    end
  end
  
  return nothing
end
This is our equivalent Python Numba+Cuda code for the performance comparison (beeing about 30 percent faster than the above Julia+Cuda implementation):
@cuda.jit
def matmul_fast(A, B, C):
    """Perform square matrix multiplication of C = A * B
    """
    tx = cuda.threadIdx.x
    sA = cuda.shared.array(shape=(m,p), dtype=float32)
    sB = cuda.shared.array(shape=p, dtype=float32)
    for j in range(p):
        sA[tx,j] = A[tx,j]
    if tx == 0:
        for j in range(p):
            sB[j] = B[j]
    # Wait until all threads finish preloading
    cuda.syncthreads()
    for j in range(2*1000):
        Cvalue = 0.0
        if tx < m:
            for i in range(p):
                Cvalue += sA[tx, i] * sB[i]
            C[tx] = Cvalue
Tim Besard:
@inbounds
FWIW, you can use a type param for the shared memory type derived from the C, A and B args
other than that, if you’re compute bound you might lose some performance over Julia’s integers being 64-bit wide, whereas Numba may be using 32-bit integers (which can execute concurrently with floating-point operations)
but you’d need to profile for that
also make sure it’s actually the kernel that’s taking 30% longer, and not just the host code
generally, low-level kernels are as fast as CUDA C, so I’d be surprised if this difference can’t be worked around
Me:
Wow 3,98 ms using @inbounds 
==4011247== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  398.30ms       100  3.9830ms  2.7499ms  5.2974ms  julia_kernel_matmul_fast_2131(CuDeviceArray<Float32, int=1, int=1>, CuDeviceArray<Float32, int=2, int=1>, CuDeviceArray<Float32, int=1, int=1>, Int64, CuDeviceArray<Float32, int=1, int=1>)
                    0.00%  13.472us         4  3.3680us  1.7280us  6.8480us  [CUDA memcpy HtoD]
                    0.00%  3.1360us         1  3.1360us  3.1360us  3.1360us  [CUDA memcpy DtoH]
      API calls:   62.94%  400.00ms         1  400.00ms  400.00ms  400.00ms  cuDevicePrimaryCtxRetain
                   25.21%  160.24ms    479530     334ns     267ns  430.17us  cuStreamQuery
                    9.11%  57.922ms    479662     120ns      88ns  430.55us  cuCtxGetCurrent
                    0.96%  6.1138ms         3  2.0379ms  1.5370us  6.1054ms    
                    0.76%  4.8186ms         1  4.8186ms  4.8186ms  4.8186ms  cuModuleLoadDataEx
                    0.41%  2.6278ms       100  26.278us  4.7700us  1.9696ms  cuLaunchKernel
                    0.30%  1.8898ms         1  1.8898ms  1.8898ms  1.8898ms  cuMemHostAlloc
                    0.17%  1.0600ms         1  1.0600ms  1.0600ms  1.0600ms  cuModuleUnload
                    0.08%  490.29us         1  490.29us  490.29us  490.29us    
                    0.01%  74.922us         3  24.974us  8.3730us  45.609us  cuMemGetInfo
                    0.01%  73.957us         4  18.489us  4.7250us  34.344us  cuMemcpyHtoDAsync
                    0.01%  39.144us         1  39.144us  39.144us  39.144us  cuMemcpyDtoHAsync
                    0.01%  32.555us         1  32.555us  32.555us  32.555us  cuCtxSynchronize
                    0.00%  27.037us         1  27.037us  27.037us  27.037us  cuStreamDestroy
                    0.00%  26.756us         1  26.756us  26.756us  26.756us  cuStreamCreate
                    0.00%  26.085us         3  8.6950us  1.7840us  15.780us
                    0.00%  13.795us         5  2.7590us     222ns  10.191us  cuCtxPushCurrent
                    0.00%  11.778us         7  1.6820us     302ns  5.2360us  cuDeviceGetCount
                    0.00%  11.164us        12     930ns     148ns  5.9480us  cuDeviceGetAttribute
                    0.00%  9.5500us         4  2.3870us     353ns  4.9700us  cuPointerGetAttribute
                    0.00%  9.2110us        15     614ns      82ns  2.0580us  cuDriverGetVersion
                    0.00%  6.7300us         7     961ns     159ns  1.9960us  cuCtxSetCurrent
                    0.00%  3.3900us         3  1.1300us     807ns  1.4420us  cuDeviceGet
                    0.00%  3.0010us         9     333ns     138ns     584ns  cuCtxGetDevice
                    0.00%  2.4110us         5     482ns     193ns  1.0440us  cuCtxPopCurrent
                    0.00%  1.7210us         1  1.7210us  1.7210us  1.7210us  cuModuleGetFunction
                    0.00%     761ns         1     761ns     761ns     761ns  cuModuleGetGlobal
                    0.00%     595ns         1     595ns     595ns     595ns  cuMemHostGetDevicePointer
                    0.00%     468ns         1     468ns     468ns     468ns
I just added one @inbounds here:
 for j in 1:2000
    Cvalue = 0.0f0
    if tx <= m
      for i = 1:p
        @inbounds Cvalue += sA[tx, i] * sB[i]
      end
      C[tx] = Cvalue
    end
  end
Tim Besard:
the shared memory accesses are also bounds checked
Me:
Ok I have added some more inbounds, the perfomance remains roughly the same:
""" Compute C = A * B """
function kernel_matmul_fast(C, A, B, m, p)
  tx = threadIdx().x
  # Important: The second @cuDynamicSharedMem allocation needs an offset of sizeof(sA), as it uses a single kernel-level buffer
  sA = @cuDynamicSharedMem(Float32, (m,p))
  sB = @cuDynamicSharedMem(Float32, p, sizeof(sA))
  # Initialize shared memory for A
  for j in 1:p
     @inbounds sA[tx, j] = A[tx, j]
  end
  # Initialize shared memory for B
  if tx == 1
    for j in 1:p
       @inbounds sB[j] = B[j]
    end
  end
  # Wait until all threads finish preloading
  sync_threads()
  for j in 1:2000
    Cvalue = 0.0f0
    if tx <= m
      for i = 1:p
        @inbounds Cvalue += sA[tx, i] * sB[i]
        #@cuprintln("tx $tx, i $i, res: $(A[tx, i] * B[i])")
      end
      @inbounds C[tx] = Cvalue
      #@cuprintln(C[tx])
    end
  end
  return nothing
end
Me:
But that’s totally fine, about 4 ms per call is way faster than I expected
Tim Besard:
great! 
Me:
Thank you again 