The performance difference of transferring (SubArray, ReshapedArray) Array to GPU

findmyway · November 20, 2019, 8:29am

I encounter some performance differences when trying to move data from cpu to gpu. The result is listed in the table bellow:

methods	cpu(x)	gpu(x)	CuArray{Float32}(x)
a = rand(Float32, 100, 100, 32);	62.947 ns (2 allocations: 368 bytes)	220.253 μs (12 allocations: 1.22 MiB)	325.540 μs (12 allocations: 2.44 MiB)
inds = rand(1:32, 32); b = @view a[:, :, inds];	1.816 μs (12 allocations: 2.30 KiB)	311.065 μs (31 allocations: 1.22 MiB)	762.167 μs (12 allocations: 2.44 MiB)
c = reshape(b, 100, 100, 1, 32);	3.110 μs (15 allocations: 2.45 KiB)	316.978 μs (34 allocations: 1.22 MiB)	2.144 ms (14 allocations: 2.44 MiB)

My questions are that:

Why is cpu(c) much slower compared to cpu(b)?
I thought that they are both copying the same data and shouldn’t have so much difference?
Why is CuArray{Float32}(a) slower than gpu(a)?
Why is gpu(c) is similar to gpu(b) but CuArray{Float32}(c) is much slower compared to CuArray{Float32}(b)

By the way, gpu(c) will trigger an error:

gpu(c)

ArgumentError: invalid index: 23.0f0 of type Float32

Where should I report this issue? (In which repo?)

Code to reproduce the result in the table above:

using CuArrays, BenchmarkTools, Flux
a = rand(Float32, 100, 100, 32);
inds = rand(1:32, 32)
b = @view a[:, :, inds];
c = reshape(b, 100, 100, 1, 32);

for x in [a,b,c]
@btime cpu($x);
@btime gpu($x);
@btime CuArray{Float32}($x);
end

maleadt · November 20, 2019, 8:44am

Only displaying does, you actually get a proper object. You can report this at the CuArrays repo.

FWIW, Flux.gpu == CuArrays.cu. The difference between cu and calling the CuArray constructor is that the former adapts the object, while the latter creates a CuArray. The difference is important:

julia> typeof(b)
SubArray{Float32,3,Array{Float32,3},Tuple{Base.Slice{Base.OneTo{Int64}},Base.Slice{Base.OneTo{Int64}},Array{Int64,1}},false}

julia> typeof(cu(b))
SubArray{Float32,3,CuArray{Float32,3,Nothing},Tuple{Base.Slice{Base.OneTo{Int64}},Base.Slice{Base.OneTo{Int64}},CuArray{Float32,1,Nothing}},false}

julia> typeof(cu(CuArray{Float32}(b)))
CuArray{Float32,3,Nothing}

findmyway · November 20, 2019, 8:50am

Now I see. Thanks for your swift reply.

https://github.com/JuliaGPU/CuArrays.jl/issues/506

Topic		Replies	Views
Correct implementation of CuArray's slicing operations GPU	3	584	October 31, 2023
CPU/GPU data transfer speed GPU	12	7463	December 6, 2019
Performance of view with cuArrays GPU	11	2670	November 11, 2018
ArgumentError: cannot take the CPU address of a CuArray when using selectdim GPU question	4	1915	April 10, 2020
CUDA CPU allocations with range General Usage cuda	5	799	January 13, 2022

The performance difference of transferring (SubArray, ReshapedArray) Array to GPU

Related topics