How to benchmark a function that uses KernelAbstractions kernels?

pitsianis · March 16, 2025, 9:17pm

What is the correct way to benchmark a function that uses KernelAbstractions.jl (KA) kernels?

I noticed that KA/CUDA i.e. KA with the CUDA.jl backend requires the @synchronize(device) macro.

Consider this

  @btime begin
    mykernel!($b)
    KernelAbstractions.synchronize($dev)
  end setup = (copyto!($b, $a0))
  @assert Array(b) == golden(a0)

where a0 is a host array and golden the host function that performs a computation, b a device array and mykernel the device code attempting to do what golden does but faster.
Is this the proper way to use BenchmarkTools.jl to measure time?

With the KA/Metal, the @synchronize(device) macro seems to add too much time, and without it, the performance times matched the expected. I did not manage to demonstrate the need for it by breaking correctness by removing it

For instance, this did not fail

using CUDA, KernelAbstractions

@kernel inbounds=true function addone!(a)
    i = @index(Global, Linear)
    a[i] += 1
end

function test(n)
  a = CuArray(rand(Float32, n,1))
  c = copy(a) .+ 1
  dev = get_backend(a)

  kernel = addone!(dev, 1024)

  for i = 1:100000
    kernel(a, ndrange=n)
    if !all(a .== c)
      @warn "Failure!"
    end
    c .+= 1
  end
end

test(2^25)

The main idea was that device array c has the correct value that device array a will get only after each kernel invocation is completed. I ran it for 100K times and no failure. Why?

yolhan_mannes · March 17, 2025, 5:32am

Maybe the .== Is also making a kernel that make sure everything is synchronise before and after it’s called? You could try to make your own reduction kernel for the !all(a.==c) and try

maleadt · March 17, 2025, 8:12am

You always need to synchronize, regardless of the back-end. Semantically, kernel launches are asynchronous, so you’re benchmarking launch time instead of execution time without synchronization. If this introduces performance problems with Metal.jl, please file an issue there.

With respect to your second example: all calls mapreduce, which automatically synchronizes. Generally, you won’t be able to observe the asynchronous nature of GPU operations when using array abstractions, as they ought to automatically synchronize when needed.

pitsianis · March 17, 2025, 9:57am

I guess if I were to call the checker in a different stream, it would catch the device array before the change.

Is there the abstraction of streams in KA and Metal like in CUDA? Sequences of operations (such as kernel launches, memory copies, and synchronization commands) submitted to the same stream, execute in order.

How do I submit the !all custom kernel to a different stream in KA or in Metal?

maleadt · March 17, 2025, 1:27pm

Use a different Julia task.

CUDA.jl automatically synchronizes when switching tasks, so you won’t catch it.

Topic		Replies	Views
KernelAbstractions is slower than CUDA GPU gpu , cuda , kernelabstractions	8	1398	November 10, 2022
CUDAnative , performance drop after several timesteps Performance question , gpu , cudanative , cuda	5	1191	September 8, 2019
Timing square function in CUDA GPU	4	1707	December 11, 2018
Several questions about KernelAbstractions GPU gpu , cuda , kernelabstractions	6	1658	January 18, 2022
Most efficient way of _waiting_ for GPU results? GPU	20	3115	January 31, 2019

How to benchmark a function that uses KernelAbstractions kernels?

Related topics