KernelAbstractions is slower than CUDA


For a simple multiplication as in the example below I see that the kernel written with KernelAbstractions is almost twice slower than the one written with pure CUDA. Am I doing something wrong, or KA indeed introduces such big overhead?

using BenchmarkTools
using CUDA
using CUDAKernels
using KernelAbstractions

# Kernel Abstractions ----------------------------------------------------------
@kernel function mulcab_ka_kernel(C, A, B)
    I = @index(Global)
    C[I] = A[I] * B[I]

function mulcab_ka(device, C, A, B)
    kernel = mulcab_ka_kernel(device)
    event = kernel(C, A, B, ndrange=size(C))
    return nothing

# CUDA -------------------------------------------------------------------------
function mulcab_cuda_kernel(C, A, B)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i=id:stride:length(C)
        C[i] = A[i] * B[i]
    return nothing

function mulcab_cuda(C, A, B)
    N = length(C)
    ckernel = @cuda launch=false mulcab_cuda_kernel(C, A, B)
    config = launch_configuration(
    threads = min(N, config.threads)
    blocks = cld(N, threads)
    CUDA.@sync ckernel(C, A, B; threads=threads, blocks=blocks)
    return nothing

# Test -------------------------------------------------------------------------
A = CUDA.ones(1024, 1024) * 2
B = CUDA.ones(1024, 1024) * 3
C = CUDA.zeros(1024, 1024)

@btime mulcab_ka(CUDADevice(), $C, $A, $B)
@assert all(C .== 6)

@btime mulcab_cuda($C, $A, $B)
@assert all(C .== 6)

#   99.187 μs (96 allocations: 4.56 KiB)
#   52.817 μs (5 allocations: 304 bytes)

Somebody… anybody…

Did you profile the kernels? The documentation has quite a section on that: Profiling · CUDA.jl. I’d first try NSight Systems to compare simple stuff like the launch configuration or memory usage, but with NSight Compute you can very easily add one kernel as the baseline to compare the other to it.

Thank you. I will try to use it as well. But, I guess, I do not have enough knowledge to understand the output.

Meanwhile, is it something wrong with the simple benchmarking which I use? I follow the recommendations from and use @btime together with CUDA.@sync to measure the run time in case of CUDA. In case of KernelAbstractions I rely on wait(event) as the synchronization mechanism.

Is it expected behaviour that KA is slower than CUDA, or they both should give the same result?

As far as I understand, under the hood KA creates some CUDA kernel. Is it possible somehow to see it (something similar to @macroexpand maybe)?

Use the @device_code_... macros.

what is the default workgroup size when you don’t pass one in explicitly? have you tried benchmark comparisons with e.g. mulcab_ka_kernel(device,256)?

I repeated the benchmarks for the current version of KernelAbstractions and CUDA (increased the array sizes from 1024x1024 to 2048x2048 for higher precision) and found the following times:

  199.781 μs (71 allocations: 3.44 KiB)   # KernelAbstractions
  187.110 μs (23 allocations: 1.19 KiB)   # CUDA

That is, nowadays the times became close to each other.
Following your advise with mulcab_ka_kernel(device,256) I get

  173.701 μs (48 allocations: 2.58 KiB)   # KernelAbstractions
  189.318 μs (23 allocations: 1.19 KiB)   # CUDA

which means that KernelAbstractions even outperform CUDA. If in the original code for CUDA kernel configuration I replace blocks = cld(N, threads) by blocks = min(cld(N, threads), config.blocks) the times become equal.

what is the default workgroup size when you don’t pass one in explicitly?

This is an interesting question. First of all I do not understand what is the workgroup size. Does it just the number of threads? Or something else? Another question, for which I did not find answer in KernelAbstractions documentation, is how to choose this workgroup size automatically. For CUDA kernels I can call launch_configuration which will tell me optimal number of threads and blocks. Are there any similar approach for KernelAbstractions?

workgroup is actually documented: “A workgroup is called a block in NVIDIA CUDA and designates a group of threads acting in parallel, preferably in lockstep.” so presumably workgroupsize is then the number of threads in the block.

there are references throughout the code to DynamicSize, but so far i have not been able to figure out if or how that chooses the workgroupsize automatically.

one at least needs a way to know the max threads per block on a given device, as if one exceeds that when manually specifying the workgroupsize, it crashes. on Nvidia one can call CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK), but then your code is not cross-platform, which is the whole point of KernelAbstractions.

i found the code which sets the number of threads and blocks to use, and so far as i can tell it maximizes the threads per block. by doing so, it minimizes the number of blocks used.

for my test case and on my machine, this results in 1024 threads per block and only 16 of the 72 SMs being used. if instead i manually specify the workgroupsize to be 256, i use almost all of the SMs and the execution time is faster.

there has been some thought into changing the heuristic to maximize the blocks, instead of the threads per block. i would wholly support this based on my limited benchmarking.