Why CUDA is so slow on y = x*w + w0?

sylvaticus · March 26, 2024, 1:41pm

Hi, I am doing my first steps on GPU programming, using the Array interface on CUDA.

Wondering why CUDA is so slower than CPU on matrix multiplication:

julia> using CUDA, BenchmarkTools

julia> a = 100;
julia> b = 30;
julia> x  = rand(Float32, 1,a);
julia> w  = rand(Float32, a,b);
julia> w0 = rand(Float32, 1,b);
julia> x_g   = CuArray(x);
julia> w_g   = CuArray(w);
julia> w0_g  = CuArray(w0);

julia> function dense(x,w,w0)
           return x*w + w0
       end
dense (generic function with 1 method)
julia> function dense2(x,w,w0)
           x_g   = CuArray(x)
           w_g   = CuArray(w)
           w0_g  = CuArray(w0)
           return dense(x_g,w_g,w0_g) |> Array
       end
dense2 (generic function with 1 method)

julia> @benchmark dense($x,$w,$w0)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.246 μs …  2.441 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.333 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.329 μs ± 79.473 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █          ▂▁ ▁                                            
  ██▇▅▄▄▃▂▂▁▁▃████▇▆▄▄▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.25 μs        Histogram: frequency by time        1.63 μs <

 Memory estimate: 352 bytes, allocs estimate: 2.

julia> @benchmark dense2($x,$w,$w0)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  45.877 μs … 603.290 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     48.382 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   48.611 μs ±   5.741 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▂▁▁▂▂▅▆▇█▆▆▃▁                                       
  ▁▁▁▁▂▃▅▇███████████████▆▅▄▃▃▃▂▂▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  45.9 μs         Histogram: frequency by time         54.5 μs <

 Memory estimate: 2.00 KiB, allocs estimate: 56.

julia> @benchmark dense($x_g,$w_g,$w0_g)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.090 μs … 594.093 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     17.153 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.308 μs ±   5.798 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▂▂▄▅▆▇███▆▆▅▄▃▃▁                                  
  ▁▁▁▁▁▂▂▂▃▅▅▇████████████████▇█▇▆▅▅▅▄▄▄▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▄
  16.1 μs         Histogram: frequency by time         19.1 μs <

 Memory estimate: 1.08 KiB, allocs estimate: 34.

I understand that the second benchmark is slow because conversion from CPU to GPU objects, but why the third one is also slow ?

maleadt · March 28, 2024, 10:43am

Your inputs are tiny, and the time to compute is dwarfed by the time to launch the kernels (which is multiple us per kernel).

Topic		Replies	Views
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	225	June 4, 2025
Parallelizaton on GPU slower than on CPU...? Performance gpu	10	2333	January 21, 2020
Matrix multiplication with CPU and CUDA GPU question	2	747	February 1, 2021
Some CUDA functions suddenly become very slow New to Julia	3	195	July 14, 2024
CUDA \| nested loops kernel GPU question	5	162	May 12, 2025

Why CUDA is so slow on y = x*w + w0?

Related topics