Hi everyone,
I’m relatively new to kernel programming with CUDA.jl and am currently implementing a kernel to compute attention using a sliding-window attention mechanism.
While profiling the code, I observed that the runtime varies depending on the macro used:
CUDA.@profile CUDA.@sync AttnGPU.sliding_window_attn!(
attn.window_attn, Q, K, mask, idx_matrix,
embed, seq_len, batch_size, w, kernels_window[1]
)
Total time: 346.58 µs
CUDA.@bprofile CUDA.@sync AttnGPU.sliding_window_attn!(
attn.window_attn, Q, K, mask, idx_matrix,
embed, seq_len, batch_size, w, kernels_window[1]
)
Time distribution: 59.37 µs ± 32.58 ( 44.32 ‥ 149.44)
time = fill(0.0, 1000)
for i in 1:1000
time[i] = CUDA.@elapsed CUDA.@sync AttnGPU.sliding_window_attn!(
attn.window_attn, Q, K, mask, idx_matrix,
embed, seq_len, batch_size, w, kernels_window[1]
)
end
println("Average time in μs: ", sum(time) * 1e6 / size(time, 1))
Mean time: 180.53 µs
I’m unsure if these differences arise from an error on my part or if there are other factors at play. Additionally, I’m curious which method provides a more reliable estimate of the kernel’s execution time.
Thanks for your insights!