CUDAnative/CuArrays: performance regression for memcopy code

Hi all,

I just updated CUDAnative, CUDAdrv, CuArrays and GPUArrays to use the master version. Now, the following test for sustained performance of device to device memory copy shows a performance regression from about 559 GB/s (with CUDAnative v0.9.1 and without CuArrays [1]) to only about 121 GB/s (with CuArrays#master and using newly CuArrays.CuArray for the device arrays [2]). I did the tests on a NVIDIA Tesla P100; 559 GB/s is a very good performance, as also obtained with a corresponding CUDA code. Here is the test:

using CUDAdrv, CUDAnative

function memcopy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    A[ix] = B[ix]
    return nothing

nx = 128*1024^2
nt = 10000
warmup = 10
A = zeros(nx);
B = rand(nx);
A = CuArray(A);
B = CuArray(B);
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)

for it = 1:nt+warmup
    if (it == warmup+1) global t0 = time() end
    @cuda blocks=nblocks threads=nthreads memcopy!(A, B);
time_s = time() - t0;

narrays = 2
GBs = (nt-warmup)/time_s/1024^3*nx*sizeof(Float64)*narrays;
println("time: $time_s; GB/s: $GBs")

Here is the output of a run with the old environment with CUDAnative v0.9.1 [1]:

> ~/julia/julia-1.0.2/bin/julia memcopy.jl 
time: 35.73286414146423; GB/s: 559.1491328794802

Here is the output of a run with the new environment with CUDAnative#master, requiring now also CuArrays.CuArray [2]:

> ~/julia/julia-1.1.0/bin/julia memcopy_new.jl 
time: 165.3068549633026; GB/s: 120.86613107747692

Note that the only difference between the two called codes memcopy.jl and memcopy_new.jl is the addition of CuArrays to the using statement:

> diff memcopy.jl memcopy_new.jl
< using CUDAdrv, CUDAnative
> using CUDAdrv, CUDAnative, CuArrays

I would imagine that the observed performance regression is due to the fact that this code requires now CuArray from CuArrays, while before it used CuArray from CUDAnative. Can you tell me how to adapt the code to get back the performance that we expect?

Thank you very much!


[1] > ~/julia/julia-1.0.2/bin/julia
(v1.0) pkg> status
Status ~/.julia/environments/v1.0/Project.toml
[c5f51814] CUDAdrv v0.8.6
[be33ccc6] CUDAnative v0.9.1

[2] > ~/julia/julia-1.1.0/bin/julia
(v1.1) pkg> status
Status ~/.julia/environments/v1.1/Project.toml
[c5f51814] CUDAdrv v3.0.0 #master (
[be33ccc6] CUDAnative v2.1.0 #master (
[3a865a2d] CuArrays v1.0.2 #master (
[0c68f7d7] GPUArrays v0.7.0 #master (

Use @inbounds, CUDAnativeā€™s CuDeviceArray now supports bounds checking (so this is unrelated to CUDAdrv.CuArray vs CuArrays.jl). Also, use the @device_code_... macros (eg. @device_code_llvm cf @code_llvm) to inspect the generated code, the difference with and without @inbounds is huge.

Thanks @maleadt! That was it! I added @inbounds in the kernel and now I get the expected performance:

> ~/julia/julia-1.1.0/bin/julia memcopy_new_inbounds.jl 
time: 35.70586609840393; GB/s: 559.5719186571731

Thanks a lot also for the hints with @device_code_*! The code obtained with @device_code_llvm when using @inbounds is about one third of the code obtained when not using it.

I really need to spend some days on writing documentationā€¦ :slightly_smiling_face: Do note that when doing @device_code_ptx, most of the code comes from the runtime libraries that are linked in, the actual kernel function is only slightly larger, but with some costly branches in the hot path.