Confusing performance of LinearAlgebra.mul! for Float64

ArrogantGao · January 21, 2024, 2:53am

Hi, I was benchmarking the performance of LinearAlgebra.mul! and got something quite confusing.

julia> using CUDA, LinearAlgebra, BenchmarkTools

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, icelake-server)
  Threads: 1 on 128 virtual cores
Environment:
  LD_LIBRARY_PATH = /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib:
  JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia

julia> CUDA.version
version
versioninfo
julia> CUDA.versioninfo()
CUDA runtime 12.3, artifact installation
CUDA driver 12.1
NVIDIA driver 530.30.2

CUDA libraries:
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+530.30.2

Julia packages:
- CUDA: 5.1.1
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

6 devices:
  0: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
  1: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
  2: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
  3: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
  4: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
  5: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)

julia> a = CUDA.rand(Float32, 4096, 4096);

julia> b = CUDA.rand(Float32, 4096, 4096);

julia> c = CUDA.zeros(Float32, 4096, 4096);

julia> @benchmark CUDA.@sync mul!($(c), $(a), $(b))
BenchmarkTools.Trial: 650 samples with 1 evaluation.
 Range (min … max):  7.261 ms …  7.919 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.658 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.693 ms ± 69.844 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    █▆ ▄    ▇  ▁    ▁
  ▅▄▅▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▆▁▁▁▁▁▄▁▁▁▁▁▁▁██▆█▇▄▇▇█▇▅█▁▆▄▄█▁▄▄█▁▁▄ ▇
  7.26 ms      Histogram: log(frequency) by time     7.91 ms <

 Memory estimate: 896 bytes, allocs estimate: 39.

julia> a = CUDA.rand(Float64, 4096, 4096);

julia> b = CUDA.rand(Float64, 4096, 4096);

julia> c = CUDA.zeros(Float64, 4096, 4096);

julia> @benchmark CUDA.@sync mul!($(c), $(a), $(b))
BenchmarkTools.Trial: 583 samples with 1 evaluation.
 Range (min … max):  7.380 ms …   9.339 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.568 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.588 ms ± 148.182 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                             █  ▃▆  ▃▄
  ▄▄▁▁▄▁▁▁▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▄▄▁▁█▇▆██▅▆██▄▁▁█▁▁▄ ▆
  7.38 ms      Histogram: log(frequency) by time      8.93 ms <

 Memory estimate: 896 bytes, allocs estimate: 39.

julia> 4096^3 * 2 / 8.588 / 1e9
16.00360427014439

Here I am using a A800 GPU with a double precision performance of 9.7 TFlops, but in the benchmarking I got a 16 TFlops, which is similar to that of the single precision, why?

Elrod · January 21, 2024, 3:19am

1e9 is for GFLOPS, 1e12 is for TFLOPS.

ArrogantGao · January 21, 2024, 3:25am

thanks, but here the unit of time is ms, so there is a 1e3 canceled out

Elrod · January 21, 2024, 3:53am

Oops, you’re right.

maleadt · January 23, 2024, 12:17pm

As discussed on Slack: The A800 has 3rd gen tensor cores, which can do FP64xF64=FP64

Topic		Replies	Views
Drastic performance hit matrix multiply different types. Internal cast julia vs numpy? Numerics	15	2489	November 4, 2018
Why mul! is so fast? General Usage question , linearalgebra	7	6937	November 26, 2019
CUDA v2 - performance regression on matrix multiplication GPU	14	1757	November 10, 2020
CUDA matmul performance GPU question , performance	11	1537	August 21, 2020
BLAS vs CUBLAS benchmark Performance question , blas , cuda	13	5912	September 11, 2020

Confusing performance of LinearAlgebra.mul! for Float64

Related topics