CUDAnative: register host memory for pinned memory access

Hi all,

how do I register host memory with CUDAnative in order to enable pinned memory access directly from a GPU kernel?

In other words, I wonder how in Julia one can do the equivalent of the following CUDA code:

    float *hbuf_d, *hbuf_h;
    int nx = 512*512*1024;

    // Allocate a host buffer, register it, get a device pointer from it.
    hbuf_h = (float*)malloc((size_t)nx*sizeof(float));
    cudaHostRegister((float*)hbuf_h, (size_t)nx*sizeof(float), cudaHostRegisterMapped);
    cudaHostGetDevicePointer((float**)&hbuf_d, (float*)hbuf_h, 0);

This is a snippet from a CUDA code to test the sustained performance of pinned memory access from a GPU kernel (the results are very similar to those of the NVIDIA ‘bandwidthTest’); I add it at the end of this post [1].

Thank you very much!


[1] CUDA code to test the sustained performance of pinned memory access from a GPU kernel:

#include <sys/time.h>
#include <stdio.h>

double get_time(){
    struct timeval  tp;
    return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );

void cuda_finalize(){
    cudaError_t ce = cudaGetLastError();  
    if(ce != cudaSuccess){ 
        printf("ERROR ON GPU: %s\n", cudaGetErrorString(ce));

#define DAT double
//#define DAT float

__global__ void copy_h2d(DAT* hbuf, DAT* dbuf){
    int ix  = blockIdx.x*blockDim.x + threadIdx.x;
    dbuf[ix] = hbuf[ix];

__global__ void copy_d2h(DAT* hbuf, DAT* dbuf){
    int ix  = blockIdx.x*blockDim.x + threadIdx.x;
    hbuf[ix] = dbuf[ix];

int main(int argc, char *argv[]){
    int warmup = 3;
    int nx     = 512*512*1024;     // 512*512*1024*8B = 2GB
    int nt     = 10;               // 10*2GB = 20GB (total amount of data transfer)
    double t0, time_s, GBs;
    dim3 block, grid;
    DAT *hbuf_d, *hbuf_h, *dbuf_d;
    block.x = 1024;  grid.x = nx/block.x;
    block.y =    1;  grid.y =    1; 
    block.z =    1;  grid.z =    1;


    // Allocate host buffer, register it, get a device pointer from it.
    hbuf_h = (DAT*)malloc((size_t)nx*sizeof(DAT));
    cudaHostRegister((DAT*)hbuf_h, (size_t)nx*sizeof(DAT), cudaHostRegisterMapped); 
    cudaHostGetDevicePointer((DAT**)&hbuf_d, (DAT*)hbuf_h, 0);

    // Allocate device buffer.

    // Initialize the host buffer.
    for(int ix=0; ix<nx; ix++){
        hbuf_h[ix] = 1.0;

    // Copy from host to device.
    for (int it=0; it<nt+warmup; it++){
        if (it==warmup) t0 = get_time();
        copy_h2d<<<grid,block>>>(hbuf_d, dbuf_d); cudaDeviceSynchronize();
    time_s = get_time() - t0;
    GBs    = 1.0/1024/1024/1024*nt*nx*sizeof(DAT)/time_s;
    printf("h2d: time: %.4f GB/s: %.4f\n", time_s, GBs);

    // Copy from device to host.
    for (int it=0; it<nt+warmup; it++){
        if (it==warmup) t0 = get_time();
        copy_d2h<<<grid,block>>>(hbuf_d, dbuf_d); cudaDeviceSynchronize();
    time_s = get_time() - t0;
    GBs    = 1.0/1024/1024/1024*nt*nx*sizeof(DAT)/time_s;
    printf("d2h: time: %.4f; GB/s: %.4f\n", time_s, GBs);

    free(hbuf_h); cudaFree(dbuf_d);

Working on it, see

Thanks, that is perfect!
It looks like the PR is nearly ready to be merged. Can you give an estimate in how much time you will be able to merge it? I am asking as I urgently need this feature and if it still took many weeks, then I would need to search for a workaround meanwhile…
Thanks again!

You can always check out that specific branch using Pkg, and even put that in your manifest to have Pkg.instantiate check it out automatically. That said, I might have some time to look at it today and merge it in the master branch of CUDAdrv, but an actual release would take some more time to make sure dependent packages are updated / tagged / … (which could take a while since the package ecosystem is currently being migrated). Won’t take “many weeks” though.

1 Like

Great, thanks @maleadt!

I merged the PRs; be sure to use GPUArrays/CuArrays/CUDAnative/CUDAdrv from master if you want to try it out.

That was fast! Thanks!!

@maleadt: I could install the new version from the master. However, now it looks to me like at the moment it is not possible to create arrays on the device as simple as before. In the past I used to do:

using CUDAdrv, CUDAnative
nx = 128*1024^2
A = zeros(nx);
A = CuArray(A);

Can you tell me how I am supposed to do it now?

And do you have an example how the pinned memory access can be done best?



PS: the installed packages are:

(v1.0) pkg> status
    Status `~/.julia/environments/v1.0/Project.toml`
  [c5f51814] CUDAdrv v2.1.0 #master (
  [be33ccc6] CUDAnative v2.1.0 #master (

That is most definitely still possible. You should use the CuArrays.jl package though. Please post some info on the actual issue, without it it’s hard to help.

Pinned memory example is in the PR.

I would like to use only CUDAdrv and CUDAnative as I don’t need the functionality of the package CuArrays. Here is the example with the error message I get:

julia> using CUDAdrv, CUDAnative

julia> nx = 128*1024^2

julia> A = zeros(nx);

julia> A = CuArray(A);
ERROR: UndefVarError: CuArray not defined
 [1] top-level scope at none:0

This is no surprise when looking at the exported symbols:

julia> names(CUDAnative)
53-element Array{Symbol,1}:

However, also the following fails:

julia> A = CUDAnative.CuArray(A);
ERROR: UndefVarError: CuArray not defined
 [1] getproperty(::Module, ::Symbol) at ./sysimg.jl:13
 [2] top-level scope at none:0

Does the function CuArray not exist anymore in CUDAnative.jl?

No. It used to exist in CUDAdrv, but we removed that functionality in favor of CuArrays. You can only allocate raw buffers with CUDAdrv now.

I will miss that probably :frowning: . Would you mind to give a quick example of how the above allocation would best be done in CUDAnative?

Does this mean that CUDAnative should typically be used together with CuArrays? (or do they conflict in any sense?)

You can’t do that in CUDAnative either, that package is for writing Julia kernels that executes on the GPU.

With CuArrays.jl, your code just works like you posted it. Why don’t you want to use that package? All CUDA packages (CUDAapi, CUDAdrv, CUDAnative, CuArrays) are compatible and each take care of certain parts of the stack, and that’s why it didn’t make sense to keep another array type in CUDAdrv.jl while there was already CuArrays.jl.

I mean, it is literally just importing CuArrays.jl and you get the familiar functionality back, just better and with much more features:

julia> using CUDAdrv, CUDAnative
[ Info: Recompiling stale cache file /home/tbesard/Julia/depot/compiled/v1.1/CUDAnative/4Zu2W.ji for CUDAnative [be33ccc6-a3ff-5ff2-a52e-74243cff1e17]

julia> nx = 128*1024^2

julia> A = zeros(nx);

julia> A = CuArray(A);
ERROR: UndefVarError: CuArray not defined
 [1] top-level scope at none:0

julia> using CuArrays
[ Info: Recompiling stale cache file /home/tbesard/Julia/depot/compiled/v1.1/CuArrays/7YFE0.ji for CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]

julia> A = CuArray(A);


OK, I see; thanks for the explanations. If these packages work seamlessly together, then that’s perfect.

They’re supposed to. Don’t expect it to be the case when you work with master branches though, in which case you might need to use the master branch of dependent packages as well.

Unfortunately, I get the following error when using the master version of CuArrays:

> julia
julia> using CuArrays
[ Info: Recompiling stale cache file /users/omlins/.julia/compiled/v1.0/CuArrays/7YFE0.ji for CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
Internal error: encountered unexpected error in runtime:
BoundsError(a=Array{Core.Compiler.BasicBlock, (32,)}[
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=1, last=7), preds=Array{Int64, (1,)}[32], succs=Array{Int64, (1,)}[2]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=8, last=14), preds=Array{Int64, (1,)}[1], succs=Array{Int64, (2,)}[5, 3]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=15, last=20), preds=Array{Int64, (1,)}[2], succs=Array{Int64, (1,)}[4]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=21, last=21), preds=Array{Int64, (1,)}[3], succs=Array{Int64, (1,)}[7]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=22, last=28), preds=Array{Int64, (1,)}[2], succs=Array{Int64, (1,)}[6]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=29, last=29), preds=Array{Int64, (1,)}[5], succs=Array{Int64, (1,)}[7]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=30, last=47), preds=Array{Int64, (2,)}[4, 6], succs=Array{Int64, (2,)}[9, 8]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=48, last=49), preds=Array{Int64, (1,)}[7], succs=Array{Int64, (0,)}[]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=50, last=50), preds=Array{Int64, (1,)}[7], succs=Array{Int64, (1,)}[10]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=51, last=52), preds=Array{Int64, (1,)}[9], succs=Array{Int64, (1,)}[11]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=53, last=53), preds=Array{Int64, (1,)}[10], succs=Array{Int64, (1,)}[12]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=54, last=54), preds=Array{Int64, (1,)}[11], succs=Array{Int64, (1,)}[13]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=55, last=56), preds=Array{Int64, (1,)}[12], succs=Array{Int64, (1,)}[14]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=57, last=65), preds=Array{Int64, (1,)}[13], succs=Array{Int64, (1,)}[15]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=66, last=71), preds=Array{Int64, (1,)}[14], succs=Array{Int64, (2,)}[17, 16]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=72, last=74), preds=Array{Int64, (1,)}[15], succs=Array{Int64, (0,)}[]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=75, last=82), preds=Array{Int64, (1,)}[15], succs=Array{Int64, (2,)}[19, 18]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=83, last=84), preds=Array{Int64, (1,)}[17], succs=Array{Int64, (0,)}[]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=85, last=85), preds=Array{Int64, (1,)}[17], succs=Array{Int64, (1,)}[20]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=86, last=87), preds=Array{Int64, (1,)}[19], succs=Array{Int64, (1,)}[21]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=88, last=88), preds=Array{Int64, (1,)}[20], succs=Array{Int64, (1,)}[22]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=89, last=89), preds=Array{Int64, (1,)}[21], succs=Array{Int64, (1,)}[23]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=90, last=90), preds=Array{Int64, (1,)}[22], succs=Array{Int64, (1,)}[24]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=91, last=92), preds=Array{Int64, (1,)}[23], succs=Array{Int64, (1,)}[25]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=93, last=93), preds=Array{Int64, (1,)}[24], succs=Array{Int64, (1,)}[26]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=94, last=94), preds=Array{Int64, (1,)}[25], succs=Array{Int64, (1,)}[27]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=95, last=95), preds=Array{Int64, (1,)}[26], succs=Array{Int64, (2,)}[29, 28]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=96, last=98), preds=Array{Int64, (1,)}[27], succs=Array{Int64, (0,)}[]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=99, last=102), preds=Array{Int64, (1,)}[27], succs=Array{Int64, (2,)}[31, 30]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=103, last=103), preds=Array{Int64, (1,)}[29], succs=Array{Int64, (1,)}[32]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=104, last=104), preds=Array{Int64, (1,)}[29], succs=Array{Int64, (1,)}[32]),
  Core.Compiler.BasicBlock(stmts=Core.Compiler.StmtRange(first=105, last=107), preds=Array{Int64, (2,)}[30, 31], succs=Array{Int64, (1,)}[1])], i=(0,))
rec_backtrace at /buildworker/worker/package_linux64/build/src/stackwalk.c:94
record_backtrace at /buildworker/worker/package_linux64/build/src/task.c:246
jl_throw at /buildworker/worker/package_linux64/build/src/task.c:577
jl_bounds_error_ints at /buildworker/worker/package_linux64/build/src/rtutils.c:187
getindex at ./array.jl:731
jfptr_getindex_1592.clone_1 at /users/omlins/julia/julia-1.0.2/lib/julia/ (unknown line)
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
replace_code_newstyle! at ./compiler/ssair/legacy.jl:80
optimize at ./compiler/optimize.jl:212
typeinf at ./compiler/typeinfer.jl:35
typeinf_ext at ./compiler/typeinfer.jl:567
typeinf_ext at ./compiler/typeinfer.jl:604
jfptr_typeinf_ext_1.clone_1 at /users/omlins/julia/julia-1.0.2/lib/julia/ (unknown line)
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1537 [inlined]
jl_apply_with_saved_exception_state at /buildworker/worker/package_linux64/build/src/rtutils.c:257
jl_type_infer at /buildworker/worker/package_linux64/build/src/gf.c:275
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1786 [inlined]
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1830
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1537 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:268
unknown function (ip: 0xffffffffffffffff)

Note that I had freshly installed the master version of all the GPU related packages:

(v1.0) pkg> status
  [c5f51814] CUDAdrv v2.1.0 #master (
  [be33ccc6] CUDAnative v2.1.0 #master (
  [3a865a2d] CuArrays v1.0.2 #master (
  [0c68f7d7] GPUArrays v0.7.0 #master (

PS: as it is about the master and not a release version, I put this issue also here (the aim is trying out pinned memory access)…

That’s a mostly harmless issue with julia 1.0, I recommend upgrading to 1.1 if possible. Nonetheless, it shouldn’t affect the functionality of the GPU stack (as far as I’ve seen).

@maleadt: to my understanding (after looking through the merged PRs in CUDAdrv, CUDAnative and CuArrays), doing pinned memory access requires currently using directly CUDAdrv; i.e it cannot yet be done on a higher level using CuArrays or CUDAnative. Is that correct?

Could you advice on how the pinned memory access could enter into high-level programming, i.e. how one could hack the missing high-level function register to enable something very approximately along these lines:

function memcopy_d2h!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    A[ix] = B[ix]
    return nothing

nx = 1024^2
A = zeros(nx);
B = rand(nx);

A_d = register(A); # register and get device pointer (or new kind of object for registered host arrays)
B   = CuArray(B);

nthreads = 1024
nblocks = ceil(Int, nx/nthreads)
@cuda blocks=nblocks threads=nthreads memcopy_d2h!(A_d, B);

I highly appreciate any help you can give. Thanks!

Your use of register is confusing, do you want pinned memory and an async memcpy, or do you want to register an existing host pointer and map it into device space?

Here’s an example of the former:

julia> A = zeros(nx);

julia> A_cpuptr = pointer(A)
Ptr{Float64} @0x00007f360f7ff040

julia> A_buf = Mem.register(Mem.Host, A_cpuptr, sizeof(A), Mem.HOSTREGISTER_DEVICEMAP)
CUDAdrv.Mem.HostBuffer(Ptr{Nothing} @0x00007f360f7ff040, 8388608, CuContext(Ptr{Nothing} @0x000000000255dc70, false, true), true)

julia> A_gpuptr = convert(CuPtr{Float64}, A_buf)

julia> A_d = unsafe_wrap(CuArray, A_gpuptr, size(A));

# proof the devicemap works

julia> A[1] = 42

julia> A_d[1]

A_d is now a device array bound to a CPU memory allocation. Accessing that memory from the GPU is pretty expensive though, since it incurs PCIE reads.

@maleadt, I am sorry that it was not fully clear. My idea was that the function register would do the equivalent to a call to cudaHostRegister(..., cudaHostRegisterMapped) and in addition also directly give back a device pointer, i.e. include the equivivalent of cudaHostGetDevicePointer(...) (compare with the CUDA C code in the 2nd paragraph of the topic description).
Moreover, my aim has been to do the equivivalent of the CUDA C code in [1] (also in the topic description), which allows to test the sustained performance of pinned memory access from a GPU kernel. In other words, my objective has been to pin an existing host buffer, map it to device memory, get a device pointer and use this pointer in a GPU kernel to do DMA of the host buffer. Thanks to your help I could do it now and, thus, the following Julia code does the same as the CUDA C code in [1] in the topic description:

using CUDAdrv, CUDAnative, CuArrays

function copy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    @inbounds A[ix] = B[ix]
    return nothing

function register(A)
    A_buf = Mem.register(Mem.Host, pointer(A), sizeof(A), Mem.HOSTREGISTER_DEVICEMAP)
    A_gpuptr = convert(CuPtr{Float64}, A_buf)
    return unsafe_wrap(CuArray, A_gpuptr, size(A));

warmup = 3
nx = 512*512*1024; #1024^2  512*512*1024
nt = 10
nthreads = 1024
nblocks = ceil(Int, nx/nthreads)
A = zeros(nx);
B = rand(nx);
A_d = register(A);
B = CuArray(B);

# Copy from host to device.
for it = 1:nt+warmup
    if (it == warmup+1) global t0 = time() end
    @cuda blocks=nblocks threads=nthreads copy!(B, A_d);
time_s = time() - t0;  
ntransfers = 1  #Number of host-device transfers per iteration
GBs = 1.0/1024^3*nt*nx*sizeof(Float64)*ntransfers/time_s;
println("h2d: time: $time_s; GB/s: $GBs")

# Copy from device to host.
for it = 1:nt+warmup
    if (it == warmup+1) global t0 = time() end
    @cuda blocks=nblocks threads=nthreads copy!(A_d, B);
time_s = time() - t0;  
ntransfers = 1  #Number of host-device transfers per iteration
GBs = 1.0/1024^3*nt*nx*sizeof(Float64)*ntransfers/time_s;
println("d2h: time: $time_s; GB/s: $GBs")

Now here are some example runs showing the obtained performance:

  1. CUDA C:
> ../cu/a.out 
h2d: time: 1.7783; GB/s: 11.2470
d2h: time: 1.7629; GB/s: 11.3448
> ../cu/a.out 
h2d: time: 1.7776; GB/s: 11.2511
d2h: time: 1.7627; GB/s: 11.3461
> ../cu/a.out 
h2d: time: 1.7783; GB/s: 11.2466
d2h: time: 1.7628; GB/s: 11.3458
  1. Julia:
> jul -O3 h_d_transfers.jl
h2d: time: 2.4673500061035156; GB/s: 8.105862545048632
d2h: time: 1.753148078918457; GB/s: 11.408049462848737
> jul -O3 h_d_transfers.jl
h2d: time: 2.4986469745635986; GB/s: 8.004332025932996
d2h: time: 1.7535121440887451; GB/s: 11.405680917250494
> jul -O3 h_d_transfers.jl
h2d: time: 2.4407520294189453; GB/s: 8.194195788402673
d2h: time: 1.7531590461730957; GB/s: 11.40797809739923

Note that all the parameters were the same for CUDA C and Julia. We can observe that the device to host transfer speed (d2h) is a tiny bit better with Julia than with CUDA C. However, the host to device transfer speed (h2d) with Julia is significantly less good (27% less) than with CUDA C. Moreover, there is a much higher performance variation in the Julia h2d experiments, than in the other experiments (Julia d2h, CUDA C h2d and CUDA C d2h). Can you tell me why the Julia h2d experiments achieve a significantly lower performance and show a higher performance variation? Do you know how to fix this?