Hi, I am doing my first steps on GPU programming, using the Array interface on CUDA.
Wondering why CUDA is so slower than CPU on matrix multiplication:
julia> using CUDA, BenchmarkTools
julia> a = 100;
julia> b = 30;
julia> x = rand(Float32, 1,a);
julia> w = rand(Float32, a,b);
julia> w0 = rand(Float32, 1,b);
julia> x_g = CuArray(x);
julia> w_g = CuArray(w);
julia> w0_g = CuArray(w0);
julia> function dense(x,w,w0)
return x*w + w0
end
dense (generic function with 1 method)
julia> function dense2(x,w,w0)
x_g = CuArray(x)
w_g = CuArray(w)
w0_g = CuArray(w0)
return dense(x_g,w_g,w0_g) |> Array
end
dense2 (generic function with 1 method)
julia> @benchmark dense($x,$w,$w0)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min âĶ max): 1.246 Ξs âĶ 2.441 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 1.333 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 1.329 Ξs Âą 79.473 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
â ââ â
ââââ
ââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
1.25 Ξs Histogram: frequency by time 1.63 Ξs <
Memory estimate: 352 bytes, allocs estimate: 2.
julia> @benchmark dense2($x,$w,$w0)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min âĶ max): 45.877 Ξs âĶ 603.290 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 48.382 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 48.611 Ξs Âą 5.741 Ξs â GC (mean Âą Ï): 0.00% Âą 0.00%
ââââââ
âââââââ
âââââââ
ââââââââââââââââââ
âââââââââââââââââââââââââââââââââââ â
45.9 Ξs Histogram: frequency by time 54.5 Ξs <
Memory estimate: 2.00 KiB, allocs estimate: 56.
julia> @benchmark dense($x_g,$w_g,$w0_g)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min âĶ max): 16.090 Ξs âĶ 594.093 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 17.153 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 17.308 Ξs Âą 5.798 Ξs â GC (mean Âą Ï): 0.00% Âą 0.00%
ââââ
ââââââââ
ââââ
ââââââââââ
â
ââââââââââââââââââââââ
â
â
âââââââââââââââââââââââââ â
16.1 Ξs Histogram: frequency by time 19.1 Ξs <
Memory estimate: 1.08 KiB, allocs estimate: 34.
I understand that the second benchmark is slow because conversion from CPU to GPU objects, but why the third one is also slow ?