Confusing performance of LinearAlgebra.mul! for Float64

Hi, I was benchmarking the performance of LinearAlgebra.mul! and got something quite confusing.

julia> using CUDA, LinearAlgebra, BenchmarkTools

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 Γ— Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, icelake-server)
  Threads: 1 on 128 virtual cores
Environment:
  LD_LIBRARY_PATH = /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib:
  JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia

julia> CUDA.version
version
versioninfo
julia> CUDA.versioninfo()
CUDA runtime 12.3, artifact installation
CUDA driver 12.1
NVIDIA driver 530.30.2

CUDA libraries:
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+530.30.2

Julia packages:
- CUDA: 5.1.1
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

6 devices:
  0: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
  1: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
  2: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
  3: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
  4: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
  5: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)

julia> a = CUDA.rand(Float32, 4096, 4096);

julia> b = CUDA.rand(Float32, 4096, 4096);

julia> c = CUDA.zeros(Float32, 4096, 4096);

julia> @benchmark CUDA.@sync mul!($(c), $(a), $(b))
BenchmarkTools.Trial: 650 samples with 1 evaluation.
 Range (min … max):  7.261 ms …  7.919 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     7.658 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   7.693 ms Β± 69.844 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                    β–ˆβ–† β–„    β–‡  ▁    ▁
  β–…β–„β–…β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–†β–β–β–β–β–β–„β–β–β–β–β–β–β–β–ˆβ–ˆβ–†β–ˆβ–‡β–„β–‡β–‡β–ˆβ–‡β–…β–ˆβ–β–†β–„β–„β–ˆβ–β–„β–„β–ˆβ–β–β–„ β–‡
  7.26 ms      Histogram: log(frequency) by time     7.91 ms <

 Memory estimate: 896 bytes, allocs estimate: 39.

julia> a = CUDA.rand(Float64, 4096, 4096);

julia> b = CUDA.rand(Float64, 4096, 4096);

julia> c = CUDA.zeros(Float64, 4096, 4096);

julia> @benchmark CUDA.@sync mul!($(c), $(a), $(b))
BenchmarkTools.Trial: 583 samples with 1 evaluation.
 Range (min … max):  7.380 ms …   9.339 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     8.568 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   8.588 ms Β± 148.182 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                             β–ˆ  β–ƒβ–†  β–ƒβ–„
  β–„β–„β–β–β–„β–β–β–β–β–β–„β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–„β–„β–β–β–ˆβ–‡β–†β–ˆβ–ˆβ–…β–†β–ˆβ–ˆβ–„β–β–β–ˆβ–β–β–„ β–†
  7.38 ms      Histogram: log(frequency) by time      8.93 ms <

 Memory estimate: 896 bytes, allocs estimate: 39.

julia> 4096^3 * 2 / 8.588 / 1e9
16.00360427014439

Here I am using a A800 GPU with a double precision performance of 9.7 TFlops, but in the benchmarking I got a 16 TFlops, which is similar to that of the single precision, why?

1 Like

1e9 is for GFLOPS, 1e12 is for TFLOPS.

thanks, but here the unit of time is ms, so there is a 1e3 canceled out

Oops, you’re right.

As discussed on Slack: The A800 has 3rd gen tensor cores, which can do FP64xF64=FP64

1 Like