CPU/GPU data transfer speed

oschulz · December 4, 2019, 2:35pm

I’m having trouble pushing data from CPU to GPU memory at a decent rate.

With 512 MB of Float32 data

julia> using CuArrays

julia> data = rand(Float32, 134217728);

everything is nice and fast once the data is on the GPU (Linux, Nvidia V100):

julia> data_on_gpu = cu(data);

julia> @time sum(data_on_gpu);
  0.002941 seconds (75 allocations: 2.578 KiB)

However, when the data transfer is taken into account, things become very slow:

julia> @time sum(cu(data));
  0.198462 seconds (87 allocations: 256.003 MiB)

So now I just have a throughput of 2.5 GB/s.

Simply copying data to the GPU and back also seems very slow, just 0.65 GB/s:

julia> @time Array(cu(data));
  0.754210 seconds (24 allocations: 1.000 GiB)

I wonder if I’m doing something wrong, here? I’d be very glad for some advice.

louisponet · December 4, 2019, 3:55pm

I’m not sure what you are timing classifies as thoughput rather than latency. What happens if you do a lot of those transfers in a loop (best to put it in a function to not run into the global scope performance issues). Also, maybe try using @btime from BenchmarkTools

oschulz · December 4, 2019, 5:28pm

Hm, but a latency of 0.2 to 0.7 seconds?

Regarding @btime: The @time numbers given were taken after warm-up, of course. But here goes:

julia> @btime sum(cu(data));
  460.202 ms (81 allocations: 512.00 MiB)

julia> @btime Array(cu(data));
  814.640 ms (15 allocations: 1.00 GiB)

louisponet · December 4, 2019, 5:33pm

What about the penalty for global scope? You can interpolate in btime with $: @btime sum(cu($data))

maleadt · December 4, 2019, 6:03pm

Your copying data to the GPU and back, so that’s 1.4 GB/s?

Anyway, this isn’t what GPUs are good at. Depending on your system and configuration you could get the copy speed around 5 to 10 GB/s or so, but those are only worth it if you’re doing a sufficient amount of work on device. Latency is another story, and we could use asynchronous copies to hide most of it (available in CUDAdrv, but not employed by CuArrays yet).

What kind of throughput do you expect on that system?

maleadt · December 4, 2019, 6:19pm

Looks like there’s some performance being left on the table though.

Copy using CuArray constructor, including time to allocate:

julia> using CuArrays, BenchmarkTools

julia> data = rand(Float32, 134217728);

julia> time = @belapsed CuArray($data);

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"1.103 GiB/s"

Leaving out the allocation:

julia> gpu_data = CuArray(data);

julia> time = @belapsed copyto!($gpu_data, $data)
0.238991723

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"2.092 GiB/s"

Using the underlying APIs:

julia> using CUDAdrv

julia> gpu = Mem.alloc(Mem.Device, sizeof(data))
CUDAdrv.Mem.DeviceBuffer(CuPtr{Nothing}(0x00007f65a6000000), 536870912, CuContext(Ptr{Nothing} @0x000000000232c140, true, true))

julia> gpu_ptr = convert(CuPtr{Float32}, gpu)
CuPtr{Float32}(0x00007f65a6000000)

julia> time = @belapsed unsafe_copyto!($gpu_ptr, $(pointer(data)), 134217728)
0.050821662

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"9.838 GiB/s"

And using pinned host memory (this one can’t be the default):

julia> cpu = Mem.alloc(Mem.Host, 134217728*sizeof(Float32))
CUDAdrv.Mem.HostBuffer(Ptr{Nothing} @0x00007f65c6000000, 536870912, CuContext(Ptr{Nothing} @0x000000000232c140, true, true), false)

julia> cpu_ptr = convert(Ptr{Float32}, cpu)
Ptr{Float32} @0x00007f65c6000000

julia> time = @belapsed unsafe_copyto!($gpu_ptr, $cpu_ptr, 134217728)
0.040853038

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"12.239 GiB/s"

I’ll open an issue.

oschulz · December 4, 2019, 8:57pm

What kind of throughput do you expect on that system?

This would be for a signal processing (DSP) application, so it’s I/O-heavy.

Depending on your system and configuration you could get the copy speed around 5 to 10 GB/s

Yes, I had kinda hoped for something between 10 to 20 GB/s (I think 20 is the PCIe limit?).

oschulz · December 4, 2019, 8:59pm

Looks like there’s some performance being left on the table though.
…
12.239 GiB/s
I’ll open an issue.

Oh, nice! Thanks a lot for looking into this, and for the advice!

oschulz · December 4, 2019, 9:17pm

Using the underlying APIs: “9.838 GiB/s”
And using pinned host memory (this one can’t be the default): “12.239 GiB/s”

9.838 GiB/s would make me perfectly happy. You think CuArrays could reach that, with defaults, in principle? That would be awesome.

maleadt · December 4, 2019, 9:38pm

I think that should be possible. I’ll have a look when I have some time.

maleadt · December 5, 2019, 1:10pm

On https://github.com/JuliaGPU/GPUArrays.jl/pull/224 + https://github.com/JuliaGPU/CuArrays.jl/pull/530

With alloc:

julia> using BenchmarkTools, CuArrays, CUDAdrv

julia> data = rand(Float32, 134217728);

julia> time = @belapsed CuArray($data)
0.050474946

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"9.906 GiB/s"

Without:

julia> gpu_data = CuArray(data);

julia> time = @belapsed unsafe_copyto!($(pointer(gpu_data)), $(pointer(data)), $(length(data)))
0.049811851

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"10.038 GiB/s"

Pinned memory:

julia> cpu_data = Mem.alloc(Mem.Host, sizeof(data));

julia> 

julia> time = @belapsed unsafe_copyto!($(pointer(gpu_data)), $(convert(typeof(pointer(data)), cpu_data)), $(length(data)))
0.041195172

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"12.137 GiB/s"

baggepinnen · December 5, 2019, 10:51pm

This might actually have a decent impact on overall training performance

oschulz · December 6, 2019, 3:22am

Wow, thanks!

Topic		Replies	Views
The performance difference of transferring (SubArray, ReshapedArray) Array to GPU GPU flux	2	644	November 20, 2019
Unusually Slow First Device-to-Host Copy on A100 GPU GPU performance , kernelabstractions	6	201	May 27, 2025
Efficiency when handling jobs larger than VRAM GPU	3	1071	January 15, 2019
CUDAnative/CuArrays: performance regression for memcopy code GPU question	3	866	April 18, 2019
CUDAnative: register host memory for pinned memory access GPU question	26	4108	September 3, 2021

CPU/GPU data transfer speed

Related topics