CUDA.jl bimodal SpMM performance

New to CUDA.jl here, and I’m running a few benchmarks to understand CUDA’s performance on sparse matrices. I observed that SpMM (sparse matrix times dense matrix) seems to have a bimodal histogram when benchmarked:

using CUDA
using CUDA.CUSPARSE
using BenchmarkTools
using SparseArrays
using LinearAlgebra: mul!

N, M = 10_000, 20_000
batch_size = 8
T = Float32
A = CuSparseMatrixCSR(sprand(T, N, M, 0.001))
rhs = CUDA.ones(T, (M, batch_size))
out = CUDA.zeros(T, (N, batch_size))
@benchmark mul!(out, A, rhs)

This yields:

BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):   6.392 μs …  12.858 ms  ┊ GC (min … max): 0.00% … 52.67%
 Time  (median):     21.825 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.199 μs ± 128.418 μs  ┊ GC (mean ± σ):  3.05% ±  0.53%

   ▄▁                                            ▂▃▄▅▅▆█▅▄▂    ▂
  ▄███▇▆▅▄▅▆▇▄▅▅▅▄▃▅▄▃▁▃▁▃▄▄▁▃▃▁▃▄▄▅▅▅▆▅▇▇▄▆▆▆▇▇███████████▆▆▆ █
  6.39 μs       Histogram: log(frequency) by time      23.8 μs <

 Memory estimate: 720 bytes, allocs estimate: 40.

The median and mean times are 3x larger than the min time. Is this just a quirk of GPU memory layouts? Could there be a way to get the faster performance more reliably?

If I set batch_size = 1, then the results are not as spread out:

BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.232 μs … 45.890 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.534 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.365 μs ±  2.662 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▆█▇▇▄▂▁▁ ▁                   ▁ ▂▁▁▁                     ▂
  ▄▃███████████▇█▇▇████████▇▇▆▆▆▆███████▇▆▆▅▄▅▄▄▂▅▃▄▃▄▄▅▂▄▄▄ █
  6.23 μs      Histogram: log(frequency) by time     20.3 μs <

 Memory estimate: 720 bytes, allocs estimate: 40.

My setup:

julia> CUDA.versioninfo()
CUDA toolkit 11.6, artifact installation
NVIDIA driver 470.103.1, for CUDA 11.4
CUDA driver 11.4

Libraries: 
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+470.103.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA RTX A6000 (sm_86, 47.173 GiB / 47.544 GiB available)

You’re forgetting to synchronize (use @benchmark CUDA.@sync mul!); this probably explains the behavior. If not, try running repeatedly under NSight Systems, grouping each iteration in an NVTX range (i.e. @benchmark NVTX.@range "mul!" CUDA.@sync mul!).

1 Like