CPU/GPU data transfer speed

I’m having trouble pushing data from CPU to GPU memory at a decent rate.

With 512 MB of Float32 data

julia> using CuArrays

julia> data = rand(Float32, 134217728);

everything is nice and fast once the data is on the GPU (Linux, Nvidia V100):

julia> data_on_gpu = cu(data);

julia> @time sum(data_on_gpu);
  0.002941 seconds (75 allocations: 2.578 KiB)

However, when the data transfer is taken into account, things become very slow:

julia> @time sum(cu(data));
  0.198462 seconds (87 allocations: 256.003 MiB)

So now I just have a throughput of 2.5 GB/s.

Simply copying data to the GPU and back also seems very slow, just 0.65 GB/s:

julia> @time Array(cu(data));
  0.754210 seconds (24 allocations: 1.000 GiB)

I wonder if I’m doing something wrong, here? I’d be very glad for some advice.

I’m not sure what you are timing classifies as thoughput rather than latency. What happens if you do a lot of those transfers in a loop (best to put it in a function to not run into the global scope performance issues). Also, maybe try using @btime from BenchmarkTools

Hm, but a latency of 0.2 to 0.7 seconds?

Regarding @btime: The @time numbers given were taken after warm-up, of course. But here goes:

julia> @btime sum(cu(data));
  460.202 ms (81 allocations: 512.00 MiB)

julia> @btime Array(cu(data));
  814.640 ms (15 allocations: 1.00 GiB)

What about the penalty for global scope? You can interpolate in btime with $: @btime sum(cu($data))

Your copying data to the GPU and back, so that’s 1.4 GB/s?

Anyway, this isn’t what GPUs are good at. Depending on your system and configuration you could get the copy speed around 5 to 10 GB/s or so, but those are only worth it if you’re doing a sufficient amount of work on device. Latency is another story, and we could use asynchronous copies to hide most of it (available in CUDAdrv, but not employed by CuArrays yet).

What kind of throughput do you expect on that system?

Looks like there’s some performance being left on the table though.

Copy using CuArray constructor, including time to allocate:

julia> using CuArrays, BenchmarkTools

julia> data = rand(Float32, 134217728);

julia> time = @belapsed CuArray($data);

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"1.103 GiB/s"

Leaving out the allocation:

julia> gpu_data = CuArray(data);

julia> time = @belapsed copyto!($gpu_data, $data)
0.238991723

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"2.092 GiB/s"

Using the underlying APIs:

julia> using CUDAdrv

julia> gpu = Mem.alloc(Mem.Device, sizeof(data))
CUDAdrv.Mem.DeviceBuffer(CuPtr{Nothing}(0x00007f65a6000000), 536870912, CuContext(Ptr{Nothing} @0x000000000232c140, true, true))

julia> gpu_ptr = convert(CuPtr{Float32}, gpu)
CuPtr{Float32}(0x00007f65a6000000)

julia> time = @belapsed unsafe_copyto!($gpu_ptr, $(pointer(data)), 134217728)
0.050821662

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"9.838 GiB/s"

And using pinned host memory (this one can’t be the default):

julia> cpu = Mem.alloc(Mem.Host, 134217728*sizeof(Float32))
CUDAdrv.Mem.HostBuffer(Ptr{Nothing} @0x00007f65c6000000, 536870912, CuContext(Ptr{Nothing} @0x000000000232c140, true, true), false)

julia> cpu_ptr = convert(Ptr{Float32}, cpu)
Ptr{Float32} @0x00007f65c6000000

julia> time = @belapsed unsafe_copyto!($gpu_ptr, $cpu_ptr, 134217728)
0.040853038

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"12.239 GiB/s"

I’ll open an issue.

5 Likes

What kind of throughput do you expect on that system?

This would be for a signal processing (DSP) application, so it’s I/O-heavy.

Depending on your system and configuration you could get the copy speed around 5 to 10 GB/s

Yes, I had kinda hoped for something between 10 to 20 GB/s (I think 20 is the PCIe limit?).

Looks like there’s some performance being left on the table though.

12.239 GiB/s
I’ll open an issue.

Oh, nice! Thanks a lot for looking into this, and for the advice!

Using the underlying APIs: “9.838 GiB/s”
And using pinned host memory (this one can’t be the default): “12.239 GiB/s”

9.838 GiB/s would make me perfectly happy. :slight_smile: You think CuArrays could reach that, with defaults, in principle? That would be awesome.

I think that should be possible. I’ll have a look when I have some time.

2 Likes

On https://github.com/JuliaGPU/GPUArrays.jl/pull/224 + https://github.com/JuliaGPU/CuArrays.jl/pull/530

With alloc:

julia> using BenchmarkTools, CuArrays, CUDAdrv

julia> data = rand(Float32, 134217728);

julia> time = @belapsed CuArray($data)
0.050474946

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"9.906 GiB/s"

Without:

julia> gpu_data = CuArray(data);

julia> time = @belapsed unsafe_copyto!($(pointer(gpu_data)), $(pointer(data)), $(length(data)))
0.049811851

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"10.038 GiB/s"

Pinned memory:

julia> cpu_data = Mem.alloc(Mem.Host, sizeof(data));

julia> 

julia> time = @belapsed unsafe_copyto!($(pointer(gpu_data)), $(convert(typeof(pointer(data)), cpu_data)), $(length(data)))
0.041195172

julia> Base.format_bytes(sizeof(data) / time) * "/s"
"12.137 GiB/s"
10 Likes

This might actually have a decent impact on overall training performance :heart_eyes:

Wow, thanks!