GPU slower than CPU for simple benchmarks

billmclean · September 22, 2024, 3:20am

I recently updated the CPU on my desktop:

julia> versioninfo()
Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 9700X 8-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, generic)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

My GPU is a few years old, though.

julia> dev=AMDGPU.device()
┌────┬───────────────────────┬──────────┬───────────┬───────────┐
│ Id │                  Name │ GCN arch │ Wavefront │    Memory │
├────┼───────────────────────┼──────────┼───────────┼───────────┤
│  1 │ AMD Radeon RX 6600 XT │  gfx1030 │        32 │ 7.984 GiB │
└────┴───────────────────────┴──────────┴───────────┴───────────┘

Comparing the performance of the CPU with the GPU for a simple matrix-matrix multiply, the latter is about 50% slower.

julia> A = rand(2^9, 2^9);
julia> @btime $A * $A;
  319.290 μs (2 allocations: 2.00 MiB)
julia> A_d = ROCArray(A);
@btime begin
   $A_d * $A_d;
   AMDGPU.synchronize()
end
473.160 μs (319 allocations: 7.50 KiB)

I tried another experiment, comparing times for an LU factorization.

import LinearAlgebra: LAPACK
import AMDGPU: rocSOLVER

N = 2^10;
A = randn(N, N);
A_d = ROCArray(A);
ipiv = zeros(Int64, N);
ipiv_d = ROCArray(zeros(Int32, N));

@btime begin
    A, ipiv, info = LAPACK.getrf!($A);
end

@btime begin
    A, ipiv, info = rocSOLVER.getrf!($A_d, $ipiv_d)
    AMDGPU.synchronize()
end

This time, the GPU was about 450% slower:

  2.006 ms (1 allocation: 8.12 KiB)
  9.178 ms (27 allocations: 960 bytes)

According to AMDGPU.versioninfo(), I have rocSOLVER 3.26.0 from

/opt/rocm-6.2.1/lib/librocsolver.so

Is this relative performance to be expected from the hardware I am using?

Oscar_Smith · September 22, 2024, 3:45am

Consumer GPUs have very weak Float64 performance. You’ll see much better results with A = rand(Float32, 2^9, 2^9);

ufechner7 · September 22, 2024, 6:51am

With two exceptions:
AMD Radeon Pro VII: 6.5 TFLOPs (~400 EUR on ebay)
AMD Radeon VII: 3.52 TFLOPs (~200 EUR on ebay)

Complete list: Reddit - Dive into anything

luraess · September 22, 2024, 7:32am

You may want trying to remove the call to synchronise(), unless you are using specific tasks or running on streams with different priorities (see Profiling · AMDGPU.jl)

billmclean · September 22, 2024, 8:20am

Thanks for these responses. The matrix-matrix multiply using Float32 shows very different results.

julia> A=randn(Float32, 2^9, 2^9);
julia> @btime $A * $A;
  148.233 μs (2 allocations: 1.00 MiB)
julia> A_d=ROCArray(A);
julia> @btime $A_d * $A_d;
  3.994 μs (12 allocations: 336 bytes)

However, LU factorization of a Float32 array is still faster on the CPU.

  1.303 ms (1 allocation: 8.12 KiB)
  5.078 ms (19 allocations: 544 bytes)

Removing AMDGPU.synchronize() did not make any difference to the runtime.

pxl-th · September 22, 2024, 8:25pm

Could be that problem size is too small to see the benefit from GPU.
On my machine and bigger problem size, GPU is ~2x faster:

CPU: 16 × AMD Ryzen 7 5800X 8-Core Processor
GPU: Radeon RX 7900 XTX (gfx1100)

julia> A = randn(Float32, 2^12, 2^12);

julia> ipiv_d = ROCArray(zeros(Int32, 2^12));

julia> Ad = ROCArray(A);

julia> @btime begin
           LinearAlgebra.LAPACK.getrf!($A);
       end
  100.515 ms (2 allocations: 32.05 KiB)

julia> @btime begin
           AMDGPU.rocSOLVER.getrf!($Ad, $ipiv_d)
           AMDGPU.synchronize()
       end
  46.101 ms (27 allocations: 960 bytes)

Elrod · September 23, 2024, 1:32am

His CPU should be over 2x faster than yours.

Gamers may have panned the 9000 series, but it should be around 2x faster per clock cycle than the 5000 and 7000 series for linear algebra.

billmclean · September 23, 2024, 6:50am

Elrod is correct. I ran the second experiment with larger matrix sizes (and F32).

N = 2^12 = 4096
  34.094 ms (2 allocations: 32.05 KiB)
  55.552 ms (23 allocations: 768 bytes)

(The 9700X is actually more like 3 times faster than the 5800X here.) The GPU starts winning if I double N again.

N = 2^13 = 8192
  250.291 ms (2 allocations: 64.05 KiB)
  198.557 ms (23 allocations: 768 bytes)

Interesting the that the RX 7900 XTX is only about 20% faster than the RX 6600 XT for N = 4096. According to techpowerup, the F32 TFLOPS for these GPUs are 81.1 and 35.17, respectively.

Topic		Replies	Views
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	229	June 4, 2025
How much faster is GPU compare to CPU GPU	16	26660	November 24, 2018
Parallelizaton on GPU slower than on CPU...? Performance gpu	10	2333	January 21, 2020
Lux tutorial: AMDGPU 20x slower than CPU New to Julia flux , amdgpu , lux	17	1377	December 5, 2023
BLAS vs CUBLAS benchmark Performance question , blas , cuda	13	5809	September 11, 2020

GPU slower than CPU for simple benchmarks

Related topics