CUDA matmul performance

Hello everyone, I have recently been trying to GPUify my code.
I first tried replacing my CPU Arrays with CuArrays as recommended, however, that resulted in a slower program execution time.

So I’m trying to dig a bit deeper and seeing where the GPU not performing as well.
One bottleneck I have experienced is just a simple multiplication.

# Testing matmul

using CUDA

N = 1_000_000
dim = 2

A = randn(Float32, dim, dim)
B = randn(Float32, dim, N)
C = zeros(Float32, dim, N)

AC = CUDA.randn(Float32, dim, dim)
BC = CUDA.randn(Float32, dim, N)
CC = CUDA.zeros(Float32, dim, N)

function cpu_matmul(C, A, B)
    C = A * B
end

@time cpu_matmul(C, A, B)

function gpu_matmul(CC, AC, BC)
    CC = AC * BC
end

CUDA.@time gpu_matmul(CC, AC, BC)

gives me the following output

  0.005610 seconds (1.19 k allocations: 7.691 MiB)
  0.017618 seconds (1.34 k CPU allocations: 68.410 KiB) (1 GPU allocation: 7.629 MiB, 0.03% gc time)

Based on this, I feel like I am doing something incorrectly. Does anyone have suggestions on increasing the GPU matmul performance?

Whenever possible, it’s best to use BenchmarkTools.jl, which helps avoid many common benchmarking pitfalls. Using the syntax suggested in the profiling section of the CUDA.jl manual, and using the in-place mul! function instead of overwriting with assignment (as you’re doing), I see the following timings:

julia> using BenchmarkTools

julia> @btime mul!(C, A, B)
  913.100 μs (0 allocations: 0 bytes)
2×1000000 Array{Float32,2}:
 -1.08557    2.99683  -0.275914  …   0.107457   0.599684    0.0584937
  0.704835  -1.84877   0.201621     -1.49509   -0.0140289  -3.25898

julia> @btime CUDA.@sync mul!(CC, AC, BC)
  422.600 μs (14 allocations: 400 bytes)
2×1000000 CuArray{Float32,2}:
 -3.29135  -1.43226   -1.27146   -1.66526   …  0.812811    3.1783   -0.49792
  2.50186   0.514047   0.654348   0.797224     0.0042752  -1.52651   0.182584
1 Like

To expand on in-place vs. re-binding assignment, look at the value of C before and after cpu_matmul:

julia> C = zeros(Float32, dim, N)
2×1000000 Array{Float32,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0

julia> cpu_matmul(C, A, B)
2×1000000 Array{Float32,2}:
 -1.08557    2.99683  -0.275914  …   0.107457   0.599684    0.0584937
  0.704835  -1.84877   0.201621     -1.49509   -0.0140289  -3.25898

julia> C
2×1000000 Array{Float32,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0

It’s unchanged! That’s because instead of replacing the values inside that array, you created a new array within the function’s scope, assigned A * B to it, and returned it. You can use broadcasted assignment (.=) to replace the values in an array, but it’s better yet to use the well-optimized, built-in mul!.

julia> function cpu_matmul!(C, A, B)
           C .= A * B
       end
cpu_matmul! (generic function with 1 method)

julia> @btime cpu_matmul!(C, A, B)
  2.118 ms (2 allocations: 7.63 MiB)
2×1000000 Array{Float32,2}:
 -1.08557    2.99683  -0.275914  …   0.107457   0.599684    0.0584937
  0.704835  -1.84877   0.201621     -1.49509   -0.0140289  -3.25898

(see? much slower than mul!)
The FAQ section of the manual mentions this behavior.

1 Like

Thanks for replying @stillyslalom!

Thanks for pointing me towards inplace functions (Linear Algebra · The Julia Language) This is definitely something I need to leverage in my code (whether CPU or GPU)

As for the comparison on the GPU, I am still not getting the performance I would like to see:
Here is my updated script with your suggestions as well as the output on my machine:

using CUDA
using BenchmarkTools
using LinearAlgebra

N = 1_000_000
dim = 2

A = randn(Float32, dim, dim)
B = randn(Float32, dim, N)
C = zeros(Float32, dim, N)

AC = CUDA.randn(Float32, dim, dim)
BC = CUDA.randn(Float32, dim, N)
CC = CUDA.zeros(Float32, dim, N)

function cpu_matmul1(C, A, B)
    C .= A * B
end

function cpu_matmul2(C, A, B)
    mul!(C, A, B)
end

function gpu_matmul1(CC, AC, BC)
    CC .= AC * BC
end

function gpu_matmul2(CC, AC, BC)
    mul!(CC, AC, BC)
end

@btime cpu_matmul1(C, A, B)
@btime cpu_matmul2(C, A, B)

@btime CUDA.@sync gpu_matmul1(CC, AC, BC)
@btime CUDA.@sync gpu_matmul2(CC, AC, BC)

Output:

  2.486 ms (2 allocations: 7.63 MiB)
  1.819 ms (0 allocations: 0 bytes)
  13.942 ms (48 allocations: 1.22 KiB)
  12.947 ms (15 allocations: 416 bytes)

What are the specs of your CPU & GPU?

Separately, is it possible to re-work your code to use C^T = B^T A^T? Matrix multiplication is sensitive to memory layout, and you can get some additional speed gains here.

julia> Aᵀ = collect(A'); Bᵀ = collect(B'); Cᵀ = collect(C');

julia> ACᵀ = CUDA.rand(Float32, dim, dim); BCᵀ = CUDA.randn(Float32, N, dim); CCᵀ = CUDA.zeros(Float32, N, dim);

julia> @btime mul!(Cᵀ, Bᵀ, Aᵀ);
  615.200 μs (0 allocations: 0 bytes)

julia> @btime @CUDA.sync mul!(CCᵀ, BCᵀ, ACᵀ);
  141.400 μs (14 allocations: 400 bytes)

Yeah I thought it might be a mismatch between my CPU and GPU
CPU: Intel i7-7500U (2) @ 2.7GHz
(Intel Core i77500U Processor 4M Cache up to 3.50 GHz Product Specifications)
GPU: NVIDIA GeForce 940MX 2GB VRAM
(NVIDIA GeForce 940MX Specs | TechPowerUp GPU Database)

Yeah I could try using transposes in my code to speed things up

Here are my results with transposes:

using CUDA
using BenchmarkTools
using LinearAlgebra

N = 1_000_000
dim = 2

A = randn(Float32, dim, dim)
B = randn(Float32, dim, N)
C = zeros(Float32, dim, N)

Aᵀ = collect(A'); Bᵀ = collect(B'); Cᵀ = collect(C');

AC = CUDA.randn(Float32, dim, dim)
BC = CUDA.randn(Float32, dim, N)
CC = CUDA.zeros(Float32, dim, N)

ACᵀ = CUDA.rand(Float32, dim, dim); BCᵀ = CUDA.randn(Float32, N, dim); CCᵀ = CUDA.zeros(Float32, N, dim);

@btime mul!(C, A, B)
@btime mul!(Cᵀ, Bᵀ, Aᵀ)

@btime CUDA.@sync mul!(CC, AC, BC)
@btime CUDA.@sync mul!(CCᵀ, BCᵀ, ACᵀ)

Output

  1.843 ms (0 allocations: 0 bytes)
  8.006 ms (0 allocations: 0 bytes)
  12.993 ms (15 allocations: 416 bytes)
  1.793 ms (15 allocations: 416 bytes)

I didn’t include the reshaping operation to get the result back in the form I want since I assume that operation is relatively much quicker.

It is interesting to note that transpose on the CPU is worse, while transpose on the GPU really speeds up the computation.

Edit: Actually sometimes the performance for CPU transpose is better

  1.864 ms (0 allocations: 0 bytes)
  1.175 ms (0 allocations: 0 bytes)
  13.057 ms (15 allocations: 416 bytes)
  1.809 ms (15 allocations: 416 bytes)

I think you’ll struggle to find many GPU vs. CPU victories when comparing an older mid-tier mobile graphics card with limited CUDA support to an i7 of the same generation. There may still be some advantage to using your GPU if you have independent, CPU-intensive calculations that can be performed simultaneously, but unless your code takes hours to run, it’s probably better to just stick to the CPU until you have access to better hardware.

3 Likes

My experience with a GT 940M is that even mid-range desktop CPUs with AVX2 can run circles around it for BLAS-heavy workloads. The GT 940MX should be beefier and has GDDR5 instead of DDR3, but that’s probably not enough to make up the performance delta. That said, it’s surprising how close the results are to a ultra-low voltage mobile CPU—I’m not sure if that speaks well for Intel or poorly for Nvidia!

Thanks for the help @stillyslalom and the insight @ToucheSir

If I get access to a better GPU I’ll try the timing again and see how the performance matches up. Nonetheless learned some good things from this discussion. Thanks!

You may also want to look at LoopVectorization (via Tullio), which does some sophisticated things with loop unrolling, reordering, and padding that can help with high-aspect-ratio matmul without needing to transpose.

julia> using Tullio, LoopVectorization, BenchmarkTools

julia> tulmul!(C, A, B) = @tullio C[i,k] = A[i,j] * B[j,k]
tulmul! (generic function with 1 method)

julia> @btime tulmul!($C, $A, $B)
  667.100 μs (276 allocations: 20.02 KiB)

julia> @btime mul!(Cᵀ, Bᵀ, Aᵀ);
  615.200 μs (0 allocations: 0 bytes)
2 Likes

What does the relative performance look like if the matrix aren’t so high aspect ratio? (Last week I was working on GPU-ifying some code which was bottlenecked by square matrix multiplication, and found big speed increases from moving to CUDA.)

My laptop’s on battery at the moment, so I can’t quote any accurate benchmark numbers, but I saw an ~8x improvement on my machine for CuBLAS versus OpenBLAS for N = 1024 square matmul when I tested earlier today. That’s entirely dependent on your CPU/GPU setup, though! I didn’t test LoopVectorization (/Tullio), but it’s typically weaker than OpenBLAS for large square matmul. Instruction cost modeling & loop unrolling & padding are all really useful for oddly-shaped matrix operations, but once you’re working with matrices bigger than L1/L2(/L3?) cache, memory traffic becomes a bottleneck, and LoopVectorization doesn’t handle that just yet. I think there are plans to incorporate a block-major storage format to ameliorate that bottleneck.

1 Like