CUDAnative/CuArrays: performance regression for memcopy code

samo · April 17, 2019, 5:54pm

Hi all,

I just updated CUDAnative, CUDAdrv, CuArrays and GPUArrays to use the master version. Now, the following test for sustained performance of device to device memory copy shows a performance regression from about 559 GB/s (with CUDAnative v0.9.1 and without CuArrays [1]) to only about 121 GB/s (with CuArrays#master and using newly CuArrays.CuArray for the device arrays [2]). I did the tests on a NVIDIA Tesla P100; 559 GB/s is a very good performance, as also obtained with a corresponding CUDA code. Here is the test:

using CUDAdrv, CUDAnative

function memcopy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    A[ix] = B[ix]
    return nothing
end

nx = 128*1024^2
nt = 10000
warmup = 10
A = zeros(nx);
B = rand(nx);
A = CuArray(A);
B = CuArray(B);
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)

for it = 1:nt+warmup
    if (it == warmup+1) global t0 = time() end
    @cuda blocks=nblocks threads=nthreads memcopy!(A, B);
end
time_s = time() - t0;

narrays = 2
GBs = (nt-warmup)/time_s/1024^3*nx*sizeof(Float64)*narrays;
println("time: $time_s; GB/s: $GBs")

Here is the output of a run with the old environment with CUDAnative v0.9.1 [1]:

> ~/julia/julia-1.0.2/bin/julia memcopy.jl 
time: 35.73286414146423; GB/s: 559.1491328794802

Here is the output of a run with the new environment with CUDAnative#master, requiring now also CuArrays.CuArray [2]:

> ~/julia/julia-1.1.0/bin/julia memcopy_new.jl 
time: 165.3068549633026; GB/s: 120.86613107747692

Note that the only difference between the two called codes memcopy.jl and memcopy_new.jl is the addition of CuArrays to the using statement:

> diff memcopy.jl memcopy_new.jl
1c1
< using CUDAdrv, CUDAnative
---
> using CUDAdrv, CUDAnative, CuArrays

I would imagine that the observed performance regression is due to the fact that this code requires now CuArray from CuArrays, while before it used CuArray from CUDAnative. Can you tell me how to adapt the code to get back the performance that we expect?

Thank you very much!

Sam

[1] > ~/julia/julia-1.0.2/bin/julia
(v1.0) pkg> status
Status ~/.julia/environments/v1.0/Project.toml
[c5f51814] CUDAdrv v0.8.6
[be33ccc6] CUDAnative v0.9.1

[2] > ~/julia/julia-1.1.0/bin/julia
(v1.1) pkg> status
Status ~/.julia/environments/v1.1/Project.toml
[c5f51814] CUDAdrv v3.0.0 #master (https://github.com/JuliaGPU/CUDAdrv.jl.git)
[be33ccc6] CUDAnative v2.1.0 #master (https://github.com/JuliaGPU/CUDAnative.jl.git)
[3a865a2d] CuArrays v1.0.2 #master (https://github.com/JuliaGPU/CuArrays.jl.git)
[0c68f7d7] GPUArrays v0.7.0 #master (https://github.com/JuliaGPU/GPUArrays.jl.git)

maleadt · April 17, 2019, 6:56pm

Use @inbounds, CUDAnative’s CuDeviceArray now supports bounds checking (so this is unrelated to CUDAdrv.CuArray vs CuArrays.jl). Also, use the @device_code_... macros (eg. @device_code_llvm cf @code_llvm) to inspect the generated code, the difference with and without @inbounds is huge.

samo · April 18, 2019, 7:56am

Thanks @maleadt! That was it! I added @inbounds in the kernel and now I get the expected performance:

> ~/julia/julia-1.1.0/bin/julia memcopy_new_inbounds.jl 
time: 35.70586609840393; GB/s: 559.5719186571731

Thanks a lot also for the hints with @device_code_*! The code obtained with @device_code_llvm when using @inbounds is about one third of the code obtained when not using it.

maleadt · April 18, 2019, 7:58am

I really need to spend some days on writing documentation… Do note that when doing @device_code_ptx, most of the code comes from the runtime libraries that are linked in, the actual kernel function is only slightly larger, but with some costly branches in the hot path.

Topic		Replies	Views
cuArrays vs CUDANative GPU	3	1362	November 14, 2018
CUDAnative: register host memory for pinned memory access GPU question	26	4099	September 3, 2021
What is the optimal way of updating CuArray? GPU cudanative	7	1504	July 5, 2018
CUDAnative is awesome! GPU	12	5976	December 3, 2018
CPU/GPU data transfer speed GPU	12	7470	December 6, 2019

CUDAnative/CuArrays: performance regression for memcopy code

Related topics