When running a bunch of code mostly based on CUDA.jl and consists of mainly (sparse or dense) linear algebra operations, I come to notice that the code runs almost at the same speed for RTX2080Ti and V100, and much faster than A100. This sounds quite counter-intuitive to me.
To dive into the problem a bit, I tried a simplest benchmark
a = CUDA.rand(1024, 1024,1024);
@benchmark CUDA.@sync $a .= sin.($a)
For A100, I get:
BenchmarkTools.Trial:
memory estimate: 576 bytes
allocs estimate: 24
--------------
minimum time: 105.454 ms (0.00% GC)
median time: 105.480 ms (0.00% GC)
mean time: 105.491 ms (0.00% GC)
maximum time: 105.541 ms (0.00% GC)
--------------
samples: 48
evals/sample: 1
For V100,
BenchmarkTools.Trial:
memory estimate: 576 bytes
allocs estimate: 24
--------------
minimum time: 21.097 ms (0.00% GC)
median time: 21.169 ms (0.00% GC)
mean time: 21.171 ms (0.00% GC)
maximum time: 23.847 ms (0.00% GC)
--------------
samples: 237
evals/sample: 1
and for RTX 2080Ti,
BenchmarkTools.Trial:
memory estimate: 576 bytes
allocs estimate: 24
--------------
minimum time: 20.237 ms (0.00% GC)
median time: 21.981 ms (0.00% GC)
mean time: 21.978 ms (0.00% GC)
maximum time: 24.720 ms (0.00% GC)
--------------
samples: 228
evals/sample: 1
Quite surprisingly, A100 is the worst by far. I would expect A100 to perform at least no slower than V100 and RTX2080Ti, if not faster because the benchmarked task is not taking the advantage of the new features.
Looks strange to me too. If I remember correctly, the A100 can be configured in a split mode where it appears as 8 distinct devices. Are you sure to have an exclusive access to it ?
Do you get the same results with 2D arrays (8192*8192) ?
Thanks for the input! I didn’t know A100 has a split mode. I am not a hundred percent sure, as I am using an HPC cluster maintained by my university; but what I do see is that my process can allocate up to full GPU memory (~40G for A100), and GPU-util reported by nvidia-smi is 100%. The power is really low, though, always under 100W. In contrast, RTX2080Ti always runs on full power of 250W. Could it be something limiting power consumption?
For 8192*8192, I get on A100:
BenchmarkTools.Trial:
memory estimate: 576 bytes
allocs estimate: 24
--------------
minimum time: 4.529 ms (0.00% GC)
median time: 4.613 ms (0.00% GC)
mean time: 4.616 ms (0.00% GC)
maximum time: 5.667 ms (0.00% GC)
--------------
samples: 1076
evals/sample: 1
and for 32768^2, that has the same length as 1024^3, I get
BenchmarkTools.Trial:
memory estimate: 576 bytes
allocs estimate: 24
--------------
minimum time: 70.220 ms (0.00% GC)
median time: 70.258 ms (0.00% GC)
mean time: 70.714 ms (0.00% GC)
maximum time: 91.338 ms (0.00% GC)
--------------
samples: 71
evals/sample: 1
and if I do a 1D array of same length (1073741824), I now have
BenchmarkTools.Trial:
memory estimate: 576 bytes
allocs estimate: 24
--------------
minimum time: 37.500 ms (0.00% GC)
median time: 37.568 ms (0.00% GC)
mean time: 37.570 ms (0.00% GC)
maximum time: 37.606 ms (0.00% GC)
--------------
samples: 133
evals/sample: 1
Looks like a particularly heavy penalty is there for A100 for higher dimensional array. For RTX2080Ti, it’s 17ms and 16ms for 2D and 1D arrays.
Do you similar discrepancy among GPU timings when benchmarking
@benchmark CUDA.@sync $a .= $a.*$a.*$a
for example ?
Did you check if on the A100 system ECC memory is disabled ? Do you have same Julia and CUDA.jl version installed across the different GPUs (if they are not on the same machine) ?
For this benchmark, I also see 3D array running about 3x than 1D array on A100 (131ms vs. 40ms median time), and both underperforming RTX2080Ti (25ms / 16ms for 3D and 1D).
They are on different machines but with shared file system. So the Julia and CUDA.jl are the same.
Thank you for pointing out. I find out ECC is enabled for A100. What does this imply?
ECC memory maintains a memory system immune to single-bit errors. The checking mechanism will however impact performance and thus ECC is usually deactivated on scientific computing hardware unless ECC is critically needed.
The RTX card won’t have ECC type of memory. It would be interesting to see if the V100 card has ECC enabled as well. Also, you may ask the system admin of the A100 hosting cluster to disable ECC on the A100 GPUs (root access and possible node reboot needed).
ECC memory (error correction code memory) is used for preventing spurious wrong results due to e.g. cosmic rays or random (extremely rare) bitflips due to faulty hardware. It comes at the cost of a small drop in performance and is usually only used in servers (though I’d like it in consumer grade hardware as well, but the memory industry doesn’t like that since they can charge extra for ECC memory).
julia> using CUDA
julia> using BenchmarkTools
julia> a = CUDA.rand(1024, 1024,1024);
julia> @benchmark CUDA.@sync $a .= sin.($a)
BenchmarkTools.Trial:
memory estimate: 576 bytes
allocs estimate: 24
--------------
minimum time: 26.850 ms (0.00% GC)
median time: 28.100 ms (0.00% GC)
mean time: 27.944 ms (0.00% GC)
maximum time: 29.024 ms (0.00% GC)
--------------
samples: 179
evals/sample: 1
Well, I am unsure ECC would indeed trigger so much of a slow down. On old cards (Fermi arch) it used to impact quite a bit, but I did not assess it again since then.
I have no more idea what could be the reason of this slow down on A100. Many things could be sub-optimal (e.g. automatic grid/block splitting) but what is puzzling is that it occurs only on the A100. Maybe you could try more advanced profiling to see if you find further hints.
Did some more test of broadcasting on n-D arrays.
Let me try to summarize what I see:
For A100, V100, P100, K80 and M6000, the time is around n times that of 1D array for an n-D array with the same amount of element.
For RTX2080Ti, in increasing number of dimensions takes much less time. 4D array takes less than 2x time than 1D array.
I guess my question now is: why is RTX2080Ti so cool with CUDA.jl?
Edit:
For broadcasting, my feeling is the code is not choosing the optimal number of grids/blocks/etc. for GPUs other than RTX. On A100 I was able to reduce runtime by 30% compared to the built-in array dot, when I defined a kernel and choose a much larger block number than launch_configuration recommends.
In addition, some linear algebra benchmarking is also showing RTX outperforming A100 on Julia:
using LinearAlgebra
A = CUDA.rand(1024*4);
B = CUDA.rand(1024*4, 1024*4);
C = CUDA.rand(1024*4);
@benchmark CUDA.@sync mul!($A, $B, $C)
results in median time 137μs on RTX, 131μs on V100, but 159μs on A100. I would have expected A100 to be significantly better in this case.
A bad launch configuration is possible; we’re just re-using what CUDA tells us here. Maybe try running under NSight Systems to see if there’s anything else going on? You can always post the qdrep files here for other people to look into. Just run something like:
using CUDA
using LinearAlgebra
A = CUDA.rand(1024*4);
B = CUDA.rand(1024*4, 1024*4);
C = CUDA.rand(1024*4);
mul!(A, B, C) # be sure to warm up
CUDA.@profile begin
for i in 1:3 # run multiple times, as the first run is sometimes slower
# use NVTX to clearly denote each run in the trace
NVTX.@range "mul!" CUDA.@sync mul!(A, B, C)
end
end
where filenames are LA for linear algebra, bc for broadcasting. What I see is that for most tests, V100 > RTX > A100, and for a few RTX is almost as fast as V100.
Code used for testing
For LA:
using CUDA
using SparseArrays
using LinearAlgebra
N = 4096
M = 1024
A = CUDA.rand(N);
B = CUDA.rand(N, N);
C = CUDA.rand(N);
D = CUDA.rand(N, M);
E = CUDA.rand(N, M);
sp = cu(sprand(N, N, 0.05))
# warm up
mul!(A, B, C)
mul!(E, B, D)
mul!(A, sp, C)
mul!(E, sp, D)
CUDA.@profile begin
for i in 1:3
NVTX.@range "mul! - MV" CUDA.@sync mul!(A, B, C)
end
for i in 1:3
NVTX.@range "mul! - MM" CUDA.@sync mul!(E, B, D)
end
for i in 1:3
NVTX.@range "mul! - spMV" CUDA.@sync mul!(A, sp, C)
end
for i in 1:3
NVTX.@range "mul! - spMM" CUDA.@sync mul!(D, sp, E)
end
end
For broadcasting:
using CUDA
using LinearAlgebra
N = 1024
A = CUDA.rand(N, N, N)
A1 = reshape(A, N^3)
A2 = reshape(A, N^2, N)
A3 = reshape(A, N, N, N)
#warm up
CUDA.@sync A1 .= sin.(A1)
CUDA.@sync A2 .= sin.(A2)
CUDA.@sync A3 .= sin.(A3)
CUDA.@profile begin
for i in 1:3
NVTX.@range "broadcast 1D" CUDA.@sync A1 .= sin.(A1)
end
for i in 1:3
NVTX.@range "broadcast 2D" CUDA.@sync A2 .= sin.(A2)
end
for i in 1:3
NVTX.@range "broadcast 3D" CUDA.@sync A3 .= sin.(A3)
end
end
Those traces just confirm your other measurements, and it doesn’t look like you’re doing anything wrong. That said, it’s good you included a linear algebra benchmarks, because it shows that CUBLAS & CUSPARSE too perform badly on your A100. So it’s unlikely that CUDA.jl is to blame here.
Did you confirm you’re not using a slice of the A100 (a MIG) instead of the entire device? Is the hardware properly wired/cooled? No device limits in place (e.g. with nvidia-smi it’s possible to set a power limit)? Or concurrent users?
I consider it to be yes, as available GPU mem is the full 40G; and with my benchmarking or other code running, the GPU util goes from 0% to 100%. There were no other users on the same compute node when I did the tests. Could there be anything else to check?
The max power limit shows 250W. Also I recall seen other users’ process (not sure what they are running though) achieving slightly higher power while running than I could get with Julia. My guess is A100 is not that ready for linear algebra?
I have no idea how to check this, but I can ask the admin. Or is there a quick way to test the hypothesis?