Hi, I was benchmarking the performance of LinearAlgebra.mul!
and got something quite confusing.
julia> using CUDA, LinearAlgebra, BenchmarkTools
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 128 Γ Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, icelake-server)
Threads: 1 on 128 virtual cores
Environment:
LD_LIBRARY_PATH = /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/lib:
JULIA_PKG_SERVER = https://mirrors.bfsu.edu.cn/julia
julia> CUDA.version
version
versioninfo
julia> CUDA.versioninfo()
CUDA runtime 12.3, artifact installation
CUDA driver 12.1
NVIDIA driver 530.30.2
CUDA libraries:
- CUBLAS: 12.3.4
- CURAND: 10.3.4
- CUFFT: 11.0.12
- CUSOLVER: 11.5.4
- CUSPARSE: 12.2.0
- CUPTI: 21.0.0
- NVML: 12.0.0+530.30.2
Julia packages:
- CUDA: 5.1.1
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.1+0
Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7
6 devices:
0: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
1: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
2: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
3: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
4: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
5: NVIDIA A800 80GB PCIe (sm_80, 79.189 GiB / 80.000 GiB available)
julia> a = CUDA.rand(Float32, 4096, 4096);
julia> b = CUDA.rand(Float32, 4096, 4096);
julia> c = CUDA.zeros(Float32, 4096, 4096);
julia> @benchmark CUDA.@sync mul!($(c), $(a), $(b))
BenchmarkTools.Trial: 650 samples with 1 evaluation.
Range (min β¦ max): 7.261 ms β¦ 7.919 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 7.658 ms β GC (median): 0.00%
Time (mean Β± Ο): 7.693 ms Β± 69.844 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ β β β β
β
ββ
ββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββ β
7.26 ms Histogram: log(frequency) by time 7.91 ms <
Memory estimate: 896 bytes, allocs estimate: 39.
julia> a = CUDA.rand(Float64, 4096, 4096);
julia> b = CUDA.rand(Float64, 4096, 4096);
julia> c = CUDA.zeros(Float64, 4096, 4096);
julia> @benchmark CUDA.@sync mul!($(c), $(a), $(b))
BenchmarkTools.Trial: 583 samples with 1 evaluation.
Range (min β¦ max): 7.380 ms β¦ 9.339 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 8.568 ms β GC (median): 0.00%
Time (mean Β± Ο): 8.588 ms Β± 148.182 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β ββ ββ
βββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββ β
7.38 ms Histogram: log(frequency) by time 8.93 ms <
Memory estimate: 896 bytes, allocs estimate: 39.
julia> 4096^3 * 2 / 8.588 / 1e9
16.00360427014439
Here I am using a A800 GPU with a double precision performance of 9.7 TFlops, but in the benchmarking I got a 16 TFlops, which is similar to that of the single precision, why?