I encounter some performance differences when trying to move data from cpu to gpu. The result is listed in the table bellow:
methods | cpu(x) | gpu(x) | CuArray{Float32}(x) |
---|---|---|---|
a = rand(Float32, 100, 100, 32); | 62.947 ns (2 allocations: 368 bytes) | 220.253 μs (12 allocations: 1.22 MiB) | 325.540 μs (12 allocations: 2.44 MiB) |
inds = rand(1:32, 32); b = @view a[:, :, inds]; | 1.816 μs (12 allocations: 2.30 KiB) | 311.065 μs (31 allocations: 1.22 MiB) | 762.167 μs (12 allocations: 2.44 MiB) |
c = reshape(b, 100, 100, 1, 32); | 3.110 μs (15 allocations: 2.45 KiB) | 316.978 μs (34 allocations: 1.22 MiB) | 2.144 ms (14 allocations: 2.44 MiB) |
My questions are that:
- Why is
cpu(c)
much slower compared tocpu(b)
?
I thought that they are both copying the same data and shouldn’t have so much difference? - Why is
CuArray{Float32}(a)
slower thangpu(a)
? - Why is
gpu(c)
is similar togpu(b)
butCuArray{Float32}(c)
is much slower compared toCuArray{Float32}(b)
By the way, gpu(c)
will trigger an error:
gpu(c)
ArgumentError: invalid index: 23.0f0 of type Float32
Where should I report this issue? (In which repo?)
Code to reproduce the result in the table above:
using CuArrays, BenchmarkTools, Flux
a = rand(Float32, 100, 100, 32);
inds = rand(1:32, 32)
b = @view a[:, :, inds];
c = reshape(b, 100, 100, 1, 32);
for x in [a,b,c]
@btime cpu($x);
@btime gpu($x);
@btime CuArray{Float32}($x);
end