CUDA.jl bimodal SpMM performance

You’re forgetting to synchronize (use @benchmark CUDA.@sync mul!); this probably explains the behavior. If not, try running repeatedly under NSight Systems, grouping each iteration in an NVTX range (i.e. @benchmark NVTX.@range "mul!" CUDA.@sync mul!).

1 Like