CUDAnative: register host memory for pinned memory access

maleadt · April 19, 2019, 7:36pm

The register call in my example is cudaHostRegister, and the convert to a CuPtr does the cudaHostGetDevicePointer, so that should be enough for you to implement your application. I don’t have time to look into your CUDA C code myself; you should use the CUDA profiler to figure out what’s wrong. It has a trace API mode to print all API calls and see if there’s mismatches.

Also, AFAIK although cudaHostRegister gives you page-locked memory, using cudaHostGetDevicePointer for zero-copy memory will not yield high performance: it will make the GPU access host memory directly. You probably want an async memcpy to get DMA transfers.

samo · April 19, 2019, 7:47pm

I did the function register exactly copying in the code from your example and it works. That is no more the issue. The question is now just why the performance with Julia is not as expected for h2d transfers. Any one else could give advice?

Thanks,

Sam

maleadt · April 19, 2019, 9:09pm

I gave you some pointers, please analyze your code yourself with eg. nvprof. There are no straightforward answers at this point anymore, as the API interactions are mostly identical. Case in point, on my system the Julia version is faster. Just use the existing memcpy which should be much faster, why would you even want to implement this yourself?

samo · April 21, 2019, 10:42am

@maleadt, first of all, I would like to thank you for the amazing work that you are doing with CUDAdrv / CUDAnative / CuArrays. I truly believe that with packages like yours Julia can enable a new era of supercomputing, where the “two language problem” can be solved and prototype and production code can become one and the same. I am fully aware that you are doing a herculean task developing and supporting in the same time these packages. So, when I ask other people for advice, it is in no way to express unhappiness with your support (which is BTW incredibly fast and efficient!), but rather to try to get other people involved in order to lower the load on you. Thanks again for everything. I will see if I can figure out something with nvprof.

maleadt · April 22, 2019, 6:53am

Thanks. My comment wasn’t ill-intended, I just meant to say that for such a specific problem you probably can’t rely on other people to know what’s up (without them actually profiling the code). So it would be good to do a little digging first and report that here.

_micro · September 2, 2021, 8:08am

samo:

function copy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    @inbounds A[ix] = B[ix]
    return nothing
end

function register(A)
    A_buf = Mem.register(Mem.Host, pointer(A), sizeof(A), Mem.HOSTREGISTER_DEVICEMAP)
    A_gpuptr = convert(CuPtr{Float64}, A_buf)
    return unsafe_wrap(CuArray, A_gpuptr, size(A));
end

warmup = 3
nx = 512*512*1024; #1024^2  512*512*1024
nt = 10
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)
A = zeros(nx);
B = rand(nx);
A_d = register(A);
B = CuArray(B);

# Copy from host to device.
for it = 1:nt+warmup
    if (it == warmup+1) global t0 = time() end
    @cuda blocks=nblocks threads=nthreads copy!(B, A_d);
    CUDAdrv.synchronize();
end
time_s = time() - t0;  
ntransfers = 1  #Number of host-device transfers per iteration
GBs = 1.0/1024^3*nt*nx*sizeof(Float64)*ntransfers/time_s;
println("h2d: time: $time_s; GB/s: $GBs")

# Copy from device to host.
for it = 1:nt+warmup
    if (it == warmup+1) global t0 = time() end
    @cuda blocks=nblocks threads=nthreads copy!(A_d, B);
    CUDAdrv.synchronize();
end
time_s = time() - t0;  
ntransfers = 1  #Number of host-device transfers per iteration
GBs = 1.0/1024^3*nt*nx*sizeof(Float64)*ntransfers/time_s;
println("d2h: time: $time_s; GB/s: $GBs")

When I try to run this example on Julia 1.6.2 using CUDA.jl > 3.4.0 I get the error message: “ERROR: Could not identify the buffer type; are you passing a valid CUDA pointer to unsafe_wrap?”. In lower versions it runs OK after replacing CUDAdrv by CUDA. What am I doing wrong?

samo · September 3, 2021, 8:42am

@_micro : I see that you opened a CUDA.jl bug after reply from @maleadt in Slack; so I am just linking this issue here:
https://github.com/JuliaGPU/CUDA.jl/issues/1125

Topic		Replies	Views
Initializing @cuStaticSharedMem array? GPU	3	1334	May 12, 2018
Shared memory limitations GPU	4	943	April 29, 2020
Constant Memory? GPU	11	2587	July 18, 2018
Local thread memory in GPU using StaticArrays GPU question , gpu , cuda	4	6238	January 26, 2020
Release: CUDAdrv/CUDAnative 2.0, CuArrays 1.0 Package Announcements gpu	0	897	March 22, 2019

CUDAnative: register host memory for pinned memory access

Related topics