Why CUDA is so slow on y = x*w + w0?

Hi, I am doing my first steps on GPU programming, using the Array interface on CUDA.

Wondering why CUDA is so slower than CPU on matrix multiplication:

julia> using CUDA, BenchmarkTools

julia> a = 100;
julia> b = 30;
julia> x  = rand(Float32, 1,a);
julia> w  = rand(Float32, a,b);
julia> w0 = rand(Float32, 1,b);
julia> x_g   = CuArray(x);
julia> w_g   = CuArray(w);
julia> w0_g  = CuArray(w0);

julia> function dense(x,w,w0)
           return x*w + w0
       end
dense (generic function with 1 method)
julia> function dense2(x,w,w0)
           x_g   = CuArray(x)
           w_g   = CuArray(w)
           w0_g  = CuArray(w0)
           return dense(x_g,w_g,w0_g) |> Array
       end
dense2 (generic function with 1 method)

julia> @benchmark dense($x,$w,$w0)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min â€Ķ max):  1.246 Ξs â€Ķ  2.441 Ξs  ┊ GC (min â€Ķ max): 0.00% â€Ķ 0.00%
 Time  (median):     1.333 ξs              ┊ GC (median):    0.00%
 Time  (mean Âą σ):   1.329 Ξs Âą 79.473 ns  ┊ GC (mean Âą σ):  0.00% Âą 0.00%

   █          ▂▁ ▁                                            
  ██▇▅▄▄▃▂▂▁▁▃████▇▆▄▄▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.25 Ξs        Histogram: frequency by time        1.63 Ξs <

 Memory estimate: 352 bytes, allocs estimate: 2.

julia> @benchmark dense2($x,$w,$w0)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min â€Ķ max):  45.877 Ξs â€Ķ 603.290 Ξs  ┊ GC (min â€Ķ max): 0.00% â€Ķ 0.00%
 Time  (median):     48.382 ξs               ┊ GC (median):    0.00%
 Time  (mean Âą σ):   48.611 Ξs Âą   5.741 Ξs  ┊ GC (mean Âą σ):  0.00% Âą 0.00%

            ▂▁▁▂▂▅▆▇█▆▆▃▁                                       
  ▁▁▁▁▂▃▅▇███████████████▆▅▄▃▃▃▂▂▂▂▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  45.9 Ξs         Histogram: frequency by time         54.5 Ξs <

 Memory estimate: 2.00 KiB, allocs estimate: 56.

julia> @benchmark dense($x_g,$w_g,$w0_g)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min â€Ķ max):  16.090 Ξs â€Ķ 594.093 Ξs  ┊ GC (min â€Ķ max): 0.00% â€Ķ 0.00%
 Time  (median):     17.153 ξs               ┊ GC (median):    0.00%
 Time  (mean Âą σ):   17.308 Ξs Âą   5.798 Ξs  ┊ GC (mean Âą σ):  0.00% Âą 0.00%

              ▂▂▄▅▆▇███▆▆▅▄▃▃▁                                  
  ▁▁▁▁▁▂▂▂▃▅▅▇████████████████▇█▇▆▅▅▅▄▄▄▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▄
  16.1 Ξs         Histogram: frequency by time         19.1 Ξs <

 Memory estimate: 1.08 KiB, allocs estimate: 34.

I understand that the second benchmark is slow because conversion from CPU to GPU objects, but why the third one is also slow ?

1 Like

Your inputs are tiny, and the time to compute is dwarfed by the time to launch the kernels (which is multiple us per kernel).

1 Like