# Why CUDA is so slow on y = x*w + w0?

Hi, I am doing my first steps on GPU programming, using the Array interface on CUDA.

Wondering why CUDA is so slower than CPU on matrix multiplication:

``````julia> using CUDA, BenchmarkTools

julia> a = 100;
julia> b = 30;
julia> x  = rand(Float32, 1,a);
julia> w  = rand(Float32, a,b);
julia> w0 = rand(Float32, 1,b);
julia> x_g   = CuArray(x);
julia> w_g   = CuArray(w);
julia> w0_g  = CuArray(w0);

julia> function dense(x,w,w0)
return x*w + w0
end
dense (generic function with 1 method)
julia> function dense2(x,w,w0)
x_g   = CuArray(x)
w_g   = CuArray(w)
w0_g  = CuArray(w0)
return dense(x_g,w_g,w0_g) |> Array
end
dense2 (generic function with 1 method)

julia> @benchmark dense(\$x,\$w,\$w0)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min âĶ max):  1.246 Îžs âĶ  2.441 Îžs  â GC (min âĶ max): 0.00% âĶ 0.00%
Time  (median):     1.333 Îžs              â GC (median):    0.00%
Time  (mean Âą Ï):   1.329 Îžs Âą 79.473 ns  â GC (mean Âą Ï):  0.00% Âą 0.00%

â          ââ â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
1.25 Îžs        Histogram: frequency by time        1.63 Îžs <

Memory estimate: 352 bytes, allocs estimate: 2.

julia> @benchmark dense2(\$x,\$w,\$w0)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min âĶ max):  45.877 Îžs âĶ 603.290 Îžs  â GC (min âĶ max): 0.00% âĶ 0.00%
Time  (median):     48.382 Îžs               â GC (median):    0.00%
Time  (mean Âą Ï):   48.611 Îžs Âą   5.741 Îžs  â GC (mean Âą Ï):  0.00% Âą 0.00%

âââââââââââââ
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
45.9 Îžs         Histogram: frequency by time         54.5 Îžs <

Memory estimate: 2.00 KiB, allocs estimate: 56.

julia> @benchmark dense(\$x_g,\$w_g,\$w0_g)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min âĶ max):  16.090 Îžs âĶ 594.093 Îžs  â GC (min âĶ max): 0.00% âĶ 0.00%
Time  (median):     17.153 Îžs               â GC (median):    0.00%
Time  (mean Âą Ï):   17.308 Îžs Âą   5.798 Îžs  â GC (mean Âą Ï):  0.00% Âą 0.00%

ââââââââââââââââ
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
16.1 Îžs         Histogram: frequency by time         19.1 Îžs <

Memory estimate: 1.08 KiB, allocs estimate: 34.
``````

I understand that the second benchmark is slow because conversion from CPU to GPU objects, but why the third one is also slow ?

1 Like

Your inputs are tiny, and the time to compute is dwarfed by the time to launch the kernels (which is multiple us per kernel).

1 Like