What is the correct way to benchmark a function that uses KernelAbstractions.jl
(KA) kernels?
I noticed that KA/CUDA i.e. KA with the CUDA.jl
backend requires the @synchronize(device)
macro.
Consider this
@btime begin
mykernel!($b)
KernelAbstractions.synchronize($dev)
end setup = (copyto!($b, $a0))
@assert Array(b) == golden(a0)
where a0
is a host array and golden
the host function that performs a computation, b
a device array and mykernel
the device code attempting to do what golden
does but faster.
Is this the proper way to use BenchmarkTools.jl
to measure time?
With the KA/Metal, the @synchronize(device)
macro seems to add too much time, and without it, the performance times matched the expected. I did not manage to demonstrate the need for it by breaking correctness by removing it
For instance, this did not fail
using CUDA, KernelAbstractions
@kernel inbounds=true function addone!(a)
i = @index(Global, Linear)
a[i] += 1
end
function test(n)
a = CuArray(rand(Float32, n,1))
c = copy(a) .+ 1
dev = get_backend(a)
kernel = addone!(dev, 1024)
for i = 1:100000
kernel(a, ndrange=n)
if !all(a .== c)
@warn "Failure!"
end
c .+= 1
end
end
test(2^25)
The main idea was that device array c
has the correct value that device array a
will get only after each kernel
invocation is completed. I ran it for 100K times and no failure. Why?