New to CUDA.jl here, and I’m running a few benchmarks to understand CUDA’s performance on sparse matrices. I observed that SpMM (sparse matrix times dense matrix) seems to have a bimodal histogram when benchmarked:
using CUDA
using CUDA.CUSPARSE
using BenchmarkTools
using SparseArrays
using LinearAlgebra: mul!
N, M = 10_000, 20_000
batch_size = 8
T = Float32
A = CuSparseMatrixCSR(sprand(T, N, M, 0.001))
rhs = CUDA.ones(T, (M, batch_size))
out = CUDA.zeros(T, (N, batch_size))
@benchmark mul!(out, A, rhs)
This yields:
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min … max): 6.392 μs … 12.858 ms ┊ GC (min … max): 0.00% … 52.67%
Time (median): 21.825 μs ┊ GC (median): 0.00%
Time (mean ± σ): 22.199 μs ± 128.418 μs ┊ GC (mean ± σ): 3.05% ± 0.53%
▄▁ ▂▃▄▅▅▆█▅▄▂ ▂
▄███▇▆▅▄▅▆▇▄▅▅▅▄▃▅▄▃▁▃▁▃▄▄▁▃▃▁▃▄▄▅▅▅▆▅▇▇▄▆▆▆▇▇███████████▆▆▆ █
6.39 μs Histogram: log(frequency) by time 23.8 μs <
Memory estimate: 720 bytes, allocs estimate: 40.
The median and mean times are 3x larger than the min time. Is this just a quirk of GPU memory layouts? Could there be a way to get the faster performance more reliably?
If I set batch_size = 1
, then the results are not as spread out:
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
Range (min … max): 6.232 μs … 45.890 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.534 μs ┊ GC (median): 0.00%
Time (mean ± σ): 8.365 μs ± 2.662 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▆█▇▇▄▂▁▁ ▁ ▁ ▂▁▁▁ ▂
▄▃███████████▇█▇▇████████▇▇▆▆▆▆███████▇▆▆▅▄▅▄▄▂▅▃▄▃▄▄▅▂▄▄▄ █
6.23 μs Histogram: log(frequency) by time 20.3 μs <
Memory estimate: 720 bytes, allocs estimate: 40.
My setup:
julia> CUDA.versioninfo()
CUDA toolkit 11.6, artifact installation
NVIDIA driver 470.103.1, for CUDA 11.4
CUDA driver 11.4
Libraries:
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+470.103.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)
Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
1 device:
0: NVIDIA RTX A6000 (sm_86, 47.173 GiB / 47.544 GiB available)