Different results with different macros for profiling

Hi everyone,

I’m relatively new to kernel programming with CUDA.jl and am currently implementing a kernel to compute attention using a sliding-window attention mechanism.

While profiling the code, I observed that the runtime varies depending on the macro used:

CUDA.@profile CUDA.@sync AttnGPU.sliding_window_attn!(
    attn.window_attn, Q, K, mask, idx_matrix,
    embed, seq_len, batch_size, w, kernels_window[1]
)

Total time: 346.58 µs

CUDA.@bprofile CUDA.@sync AttnGPU.sliding_window_attn!(
    attn.window_attn, Q, K, mask, idx_matrix, 
    embed, seq_len, batch_size, w, kernels_window[1]
    )

Time distribution: 59.37 µs ± 32.58 ( 44.32 ‥ 149.44)

time = fill(0.0, 1000)
for i in 1:1000
    time[i] = CUDA.@elapsed CUDA.@sync AttnGPU.sliding_window_attn!(
        attn.window_attn, Q, K, mask, idx_matrix,
        embed, seq_len, batch_size, w, kernels_window[1]
    )
end
println("Average time in μs: ", sum(time) * 1e6 / size(time, 1))

Mean time: 180.53 µs

I’m unsure if these differences arise from an error on my part or if there are other factors at play. Additionally, I’m curious which method provides a more reliable estimate of the kernel’s execution time.

Thanks for your insights!

Hi,

CUDA.@profile will execute your code only once, so if you run a small code block, you will get a poor (high variance) estimate of the expected run time. Additionally, I think there’s also some ‘warm-up’ involved (apart from compilation), meaning your code will run slower if it only runs once. Of course, you could get around this by profiling a loop repeating your code, but that’s already close to the point of CUDA.@bprofile.

Without access to the AttnGPU.sliding_window_attn! implementation, I’ll use

using CUDA
x = CUDA.rand(Float32, 10^8)
y = similar(x)
y .= x .* 2

as an example. With CUDA.@profile y .= x .* 2, I get anything between 2 and 100 ms, but typically around 80 ms. By contrast,

CUDA.@profile begin
    for _ = 1:1000
        CUDA.@sync y .= x .* 2
    end
end

gives me fairly consistent values, e.g. 2.25 ms ± 4.02 ( 1.99 ‥ 86.06) device-side. Using CUDA.@bprofile y .= x .* 2, I get similar values, e.g. 2.17 ms ± 2.64 ( 1.99 ‥ 72.2)

(By the way, note that CUDA.@time, CUDA.@elapsed, CUDA.@profile, CUDA.@bprofile already synchronize(), so you don’t have to do this manually. (You can check this yourself using e.g. @less CUDA.@elapsed y .= x .* 2 or Using MacroTools; @macroexpand CUDA.@elapsed y .= x .* 2.) In the loop example, we want to measure the time of the iterations themselves, not of the entire loop, so have to synchronise after each step, not just once at the end.)


I cannot replicate the difference with the manual CUDA.@elapsed loop:

mean([CUDA.@elapsed y .= x .* 2 for _ = 1:1000])

gives me around 2.2 ms. Also

Using BenchmarkTools
@benchmark CUDA.@sync y .= x .* 2

gives me similar (total, device + host) times, e.g. Time (mean ± σ): 2.141 ms ± 55.957 μs.


Not sure what would be the exact cause of your problem then, but it seems likely to me that it is specific to your AttnGPU.sliding_window_attn!. Though you might want to check if you get consistent profiling results for something simple like y .= x .* 2, and that there’s not a larger problem somehow specific to your machine (been there :slight_smile: ).

Hi eldee,

thanks for you help and your comments :grinning:.

I repeated the execution with CUDA.@profile around 100 times to check for consistency. The results for AttnGPU.sliding_window_attn! were all between 330 and 370 µs.
I will try profiling the code with a simpler function when I am back. Currently, I only have my notebook with me and it does not have a GPU.

Another possible explanation is the overhead imposed by profiling. It’s normally very small, but depending on your kernel’s characteristics it can matter (refer to the CUPTI docs for more details, e.g., 2. Usage — Cupti 12.6 documentation).

You can try the serial profiling mode instead by doing CUDA.@profile concurrent=false ...