GPU slower than CPU for simple benchmarks

I recently updated the CPU on my desktop:

julia> versioninfo()
Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 9700X 8-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, generic)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

My GPU is a few years old, though.

julia> dev=AMDGPU.device()
┌────┬───────────────────────┬──────────┬───────────┬───────────┐
│ Id │                  Name │ GCN arch │ Wavefront │    Memory │
├────┼───────────────────────┼──────────┼───────────┼───────────┤
│  1 │ AMD Radeon RX 6600 XT │  gfx1030 │        32 │ 7.984 GiB │
└────┴───────────────────────┴──────────┴───────────┴───────────┘

Comparing the performance of the CPU with the GPU for a simple matrix-matrix multiply, the latter is about 50% slower.

julia> A = rand(2^9, 2^9);
julia> @btime $A * $A;
  319.290 μs (2 allocations: 2.00 MiB)
julia> A_d = ROCArray(A);
@btime begin
   $A_d * $A_d;
   AMDGPU.synchronize()
end
473.160 μs (319 allocations: 7.50 KiB)

I tried another experiment, comparing times for an LU factorization.

import LinearAlgebra: LAPACK
import AMDGPU: rocSOLVER

N = 2^10;
A = randn(N, N);
A_d = ROCArray(A);
ipiv = zeros(Int64, N);
ipiv_d = ROCArray(zeros(Int32, N));

@btime begin
    A, ipiv, info = LAPACK.getrf!($A);
end

@btime begin
    A, ipiv, info = rocSOLVER.getrf!($A_d, $ipiv_d)
    AMDGPU.synchronize()
end

This time, the GPU was about 450% slower:

  2.006 ms (1 allocation: 8.12 KiB)
  9.178 ms (27 allocations: 960 bytes)

According to AMDGPU.versioninfo(), I have rocSOLVER 3.26.0 from

/opt/rocm-6.2.1/lib/librocsolver.so

Is this relative performance to be expected from the hardware I am using?

Consumer GPUs have very weak Float64 performance. You’ll see much better results with A = rand(Float32, 2^9, 2^9);

1 Like

With two exceptions:
AMD Radeon Pro VII: 6.5 TFLOPs (~400 EUR on ebay)
AMD Radeon VII: 3.52 TFLOPs (~200 EUR on ebay)

Complete list: Reddit - Dive into anything

You may want trying to remove the call to synchronise(), unless you are using specific tasks or running on streams with different priorities (see Profiling · AMDGPU.jl)

Thanks for these responses. The matrix-matrix multiply using Float32 shows very different results.

julia> A=randn(Float32, 2^9, 2^9);
julia> @btime $A * $A;
  148.233 μs (2 allocations: 1.00 MiB)
julia> A_d=ROCArray(A);
julia> @btime $A_d * $A_d;
  3.994 μs (12 allocations: 336 bytes)

However, LU factorization of a Float32 array is still faster on the CPU.

  1.303 ms (1 allocation: 8.12 KiB)
  5.078 ms (19 allocations: 544 bytes)

Removing AMDGPU.synchronize() did not make any difference to the runtime.

Could be that problem size is too small to see the benefit from GPU.
On my machine and bigger problem size, GPU is ~2x faster:

  • CPU: 16 × AMD Ryzen 7 5800X 8-Core Processor
  • GPU: Radeon RX 7900 XTX (gfx1100)
julia> A = randn(Float32, 2^12, 2^12);

julia> ipiv_d = ROCArray(zeros(Int32, 2^12));

julia> Ad = ROCArray(A);

julia> @btime begin
           LinearAlgebra.LAPACK.getrf!($A);
       end
  100.515 ms (2 allocations: 32.05 KiB)

julia> @btime begin
           AMDGPU.rocSOLVER.getrf!($Ad, $ipiv_d)
           AMDGPU.synchronize()
       end
  46.101 ms (27 allocations: 960 bytes)
1 Like

His CPU should be over 2x faster than yours.

Gamers may have panned the 9000 series, but it should be around 2x faster per clock cycle than the 5000 and 7000 series for linear algebra.

1 Like

Elrod is correct. I ran the second experiment with larger matrix sizes (and F32).

N = 2^12 = 4096
  34.094 ms (2 allocations: 32.05 KiB)
  55.552 ms (23 allocations: 768 bytes)

(The 9700X is actually more like 3 times faster than the 5800X here.) The GPU starts winning if I double N again.

N = 2^13 = 8192
  250.291 ms (2 allocations: 64.05 KiB)
  198.557 ms (23 allocations: 768 bytes)

Interesting the that the RX 7900 XTX is only about 20% faster than the RX 6600 XT for N = 4096. According to techpowerup, the F32 TFLOPS for these GPUs are 81.1 and 35.17, respectively.

1 Like