I’m having trouble pushing data from CPU to GPU memory at a decent rate.
With 512 MB of Float32 data
julia> using CuArrays
julia> data = rand(Float32, 134217728);
everything is nice and fast once the data is on the GPU (Linux, Nvidia V100):
julia> data_on_gpu = cu(data);
julia> @time sum(data_on_gpu);
0.002941 seconds (75 allocations: 2.578 KiB)
However, when the data transfer is taken into account, things become very slow:
julia> @time sum(cu(data));
0.198462 seconds (87 allocations: 256.003 MiB)
So now I just have a throughput of 2.5 GB/s.
Simply copying data to the GPU and back also seems very slow, just 0.65 GB/s:
julia> @time Array(cu(data));
0.754210 seconds (24 allocations: 1.000 GiB)
I wonder if I’m doing something wrong, here? I’d be very glad for some advice.