How to benchmark a function that uses KernelAbstractions kernels?

What is the correct way to benchmark a function that uses KernelAbstractions.jl (KA) kernels?

I noticed that KA/CUDA i.e. KA with the CUDA.jl backend requires the @synchronize(device) macro.

Consider this

  @btime begin
    mykernel!($b)
    KernelAbstractions.synchronize($dev)
  end setup = (copyto!($b, $a0))
  @assert Array(b) == golden(a0)

where a0 is a host array and golden the host function that performs a computation, b a device array and mykernel the device code attempting to do what golden does but faster.
Is this the proper way to use BenchmarkTools.jl to measure time?

With the KA/Metal, the @synchronize(device) macro seems to add too much time, and without it, the performance times matched the expected. I did not manage to demonstrate the need for it by breaking correctness by removing it :slight_smile:

For instance, this did not fail

using CUDA, KernelAbstractions

@kernel inbounds=true function addone!(a)
    i = @index(Global, Linear)
    a[i] += 1
end

function test(n)
  a = CuArray(rand(Float32, n,1))
  c = copy(a) .+ 1
  dev = get_backend(a)

  kernel = addone!(dev, 1024)

  for i = 1:100000
    kernel(a, ndrange=n)
    if !all(a .== c)
      @warn "Failure!"
    end
    c .+= 1
  end
end

test(2^25)

The main idea was that device array c has the correct value that device array a will get only after each kernel invocation is completed. I ran it for 100K times and no failure. Why?

Maybe the .== Is also making a kernel that make sure everything is synchronise before and after it’s called? You could try to make your own reduction kernel for the !all(a.==c) and try

You always need to synchronize, regardless of the back-end. Semantically, kernel launches are asynchronous, so you’re benchmarking launch time instead of execution time without synchronization. If this introduces performance problems with Metal.jl, please file an issue there.

With respect to your second example: all calls mapreduce, which automatically synchronizes. Generally, you won’t be able to observe the asynchronous nature of GPU operations when using array abstractions, as they ought to automatically synchronize when needed.

I guess if I were to call the checker in a different stream, it would catch the device array before the change.

Is there the abstraction of streams in KA and Metal like in CUDA? Sequences of operations (such as kernel launches, memory copies, and synchronization commands) submitted to the same stream, execute in order.

How do I submit the !all custom kernel to a different stream in KA or in Metal?

Use a different Julia task.

CUDA.jl automatically synchronizes when switching tasks, so you won’t catch it.