KernelAbstractions is slower than CUDA


For a simple multiplication as in the example below I see that the kernel written with KernelAbstractions is almost twice slower than the one written with pure CUDA. Am I doing something wrong, or KA indeed introduces such big overhead?

using BenchmarkTools
using CUDA
using CUDAKernels
using KernelAbstractions

# Kernel Abstractions ----------------------------------------------------------
@kernel function mulcab_ka_kernel(C, A, B)
    I = @index(Global)
    C[I] = A[I] * B[I]

function mulcab_ka(device, C, A, B)
    kernel = mulcab_ka_kernel(device)
    event = kernel(C, A, B, ndrange=size(C))
    return nothing

# CUDA -------------------------------------------------------------------------
function mulcab_cuda_kernel(C, A, B)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    for i=id:stride:length(C)
        C[i] = A[i] * B[i]
    return nothing

function mulcab_cuda(C, A, B)
    N = length(C)
    ckernel = @cuda launch=false mulcab_cuda_kernel(C, A, B)
    config = launch_configuration(
    threads = min(N, config.threads)
    blocks = cld(N, threads)
    CUDA.@sync ckernel(C, A, B; threads=threads, blocks=blocks)
    return nothing

# Test -------------------------------------------------------------------------
A = CUDA.ones(1024, 1024) * 2
B = CUDA.ones(1024, 1024) * 3
C = CUDA.zeros(1024, 1024)

@btime mulcab_ka(CUDADevice(), $C, $A, $B)
@assert all(C .== 6)

@btime mulcab_cuda($C, $A, $B)
@assert all(C .== 6)

#   99.187 μs (96 allocations: 4.56 KiB)
#   52.817 μs (5 allocations: 304 bytes)

Somebody… anybody…

Did you profile the kernels? The documentation has quite a section on that: Profiling · CUDA.jl. I’d first try NSight Systems to compare simple stuff like the launch configuration or memory usage, but with NSight Compute you can very easily add one kernel as the baseline to compare the other to it.

Thank you. I will try to use it as well. But, I guess, I do not have enough knowledge to understand the output.

Meanwhile, is it something wrong with the simple benchmarking which I use? I follow the recommendations from and use @btime together with CUDA.@sync to measure the run time in case of CUDA. In case of KernelAbstractions I rely on wait(event) as the synchronization mechanism.

Is it expected behaviour that KA is slower than CUDA, or they both should give the same result?

As far as I understand, under the hood KA creates some CUDA kernel. Is it possible somehow to see it (something similar to @macroexpand maybe)?

Use the @device_code_... macros.