CUDAnative: register host memory for pinned memory access

@maleadt, I am sorry that it was not fully clear. My idea was that the function register would do the equivalent to a call to cudaHostRegister(..., cudaHostRegisterMapped) and in addition also directly give back a device pointer, i.e. include the equivivalent of cudaHostGetDevicePointer(...) (compare with the CUDA C code in the 2nd paragraph of the topic description).
Moreover, my aim has been to do the equivivalent of the CUDA C code in [1] (also in the topic description), which allows to test the sustained performance of pinned memory access from a GPU kernel. In other words, my objective has been to pin an existing host buffer, map it to device memory, get a device pointer and use this pointer in a GPU kernel to do DMA of the host buffer. Thanks to your help I could do it now and, thus, the following Julia code does the same as the CUDA C code in [1] in the topic description:

using CUDAdrv, CUDAnative, CuArrays

function copy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    @inbounds A[ix] = B[ix]
    return nothing
end

function register(A)
    A_buf = Mem.register(Mem.Host, pointer(A), sizeof(A), Mem.HOSTREGISTER_DEVICEMAP)
    A_gpuptr = convert(CuPtr{Float64}, A_buf)
    return unsafe_wrap(CuArray, A_gpuptr, size(A));
end

warmup = 3
nx = 512*512*1024; #1024^2  512*512*1024
nt = 10
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)
A = zeros(nx);
B = rand(nx);
A_d = register(A);
B = CuArray(B);

# Copy from host to device.
for it = 1:nt+warmup
    if (it == warmup+1) global t0 = time() end
    @cuda blocks=nblocks threads=nthreads copy!(B, A_d);
    CUDAdrv.synchronize();
end
time_s = time() - t0;  
ntransfers = 1  #Number of host-device transfers per iteration
GBs = 1.0/1024^3*nt*nx*sizeof(Float64)*ntransfers/time_s;
println("h2d: time: $time_s; GB/s: $GBs")

# Copy from device to host.
for it = 1:nt+warmup
    if (it == warmup+1) global t0 = time() end
    @cuda blocks=nblocks threads=nthreads copy!(A_d, B);
    CUDAdrv.synchronize();
end
time_s = time() - t0;  
ntransfers = 1  #Number of host-device transfers per iteration
GBs = 1.0/1024^3*nt*nx*sizeof(Float64)*ntransfers/time_s;
println("d2h: time: $time_s; GB/s: $GBs")

Now here are some example runs showing the obtained performance:

  1. CUDA C:
> ../cu/a.out 
h2d: time: 1.7783; GB/s: 11.2470
d2h: time: 1.7629; GB/s: 11.3448
> ../cu/a.out 
h2d: time: 1.7776; GB/s: 11.2511
d2h: time: 1.7627; GB/s: 11.3461
> ../cu/a.out 
h2d: time: 1.7783; GB/s: 11.2466
d2h: time: 1.7628; GB/s: 11.3458
  1. Julia:
> jul -O3 h_d_transfers.jl
h2d: time: 2.4673500061035156; GB/s: 8.105862545048632
d2h: time: 1.753148078918457; GB/s: 11.408049462848737
> jul -O3 h_d_transfers.jl
h2d: time: 2.4986469745635986; GB/s: 8.004332025932996
d2h: time: 1.7535121440887451; GB/s: 11.405680917250494
> jul -O3 h_d_transfers.jl
h2d: time: 2.4407520294189453; GB/s: 8.194195788402673
d2h: time: 1.7531590461730957; GB/s: 11.40797809739923

Note that all the parameters were the same for CUDA C and Julia. We can observe that the device to host transfer speed (d2h) is a tiny bit better with Julia than with CUDA C. However, the host to device transfer speed (h2d) with Julia is significantly less good (27% less) than with CUDA C. Moreover, there is a much higher performance variation in the Julia h2d experiments, than in the other experiments (Julia d2h, CUDA C h2d and CUDA C d2h). Can you tell me why the Julia h2d experiments achieve a significantly lower performance and show a higher performance variation? Do you know how to fix this?

Thanks!!

Sam

1 Like