The performance difference of transferring (SubArray, ReshapedArray) Array to GPU

I encounter some performance differences when trying to move data from cpu to gpu. The result is listed in the table bellow:

methods cpu(x) gpu(x) CuArray{Float32}(x)
a = rand(Float32, 100, 100, 32); 62.947 ns (2 allocations: 368 bytes) 220.253 μs (12 allocations: 1.22 MiB) 325.540 μs (12 allocations: 2.44 MiB)
inds = rand(1:32, 32); b = @view a[:, :, inds]; 1.816 μs (12 allocations: 2.30 KiB) 311.065 μs (31 allocations: 1.22 MiB) 762.167 μs (12 allocations: 2.44 MiB)
c = reshape(b, 100, 100, 1, 32); 3.110 μs (15 allocations: 2.45 KiB) 316.978 μs (34 allocations: 1.22 MiB) 2.144 ms (14 allocations: 2.44 MiB)

My questions are that:

  1. Why is cpu(c) much slower compared to cpu(b)?
    I thought that they are both copying the same data and shouldn’t have so much difference?
  2. Why is CuArray{Float32}(a) slower than gpu(a)?
  3. Why is gpu(c) is similar to gpu(b) but CuArray{Float32}(c) is much slower compared to CuArray{Float32}(b)

By the way, gpu(c) will trigger an error:


ArgumentError: invalid index: 23.0f0 of type Float32

Where should I report this issue? (In which repo?)

Code to reproduce the result in the table above:
using CuArrays, BenchmarkTools, Flux
a = rand(Float32, 100, 100, 32);
inds = rand(1:32, 32)
b = @view a[:, :, inds];
c = reshape(b, 100, 100, 1, 32);

for x in [a,b,c]
@btime cpu($x);
@btime gpu($x);
@btime CuArray{Float32}($x);

Only displaying does, you actually get a proper object. You can report this at the CuArrays repo.

FWIW, Flux.gpu == The difference between cu and calling the CuArray constructor is that the former adapts the object, while the latter creates a CuArray. The difference is important:

julia> typeof(b)

julia> typeof(cu(b))

julia> typeof(cu(CuArray{Float32}(b)))
1 Like

Now I see. Thanks for your swift reply.