CuArrays/CUDAnative/CUDAdrv equivalent to cudaMemcpy3DAsync?

What is the CuArrays/CUDAnative/CUDAdrv equivalent to cudaMemcpy3DAsync?

More precisely, I am looking for how to do the equivalent of the following CUDA C code, which copies a XZ-plane of a 3-D device array to a contiguous 1-D host array:

int main(){
  const size_t nx = 512;
  const size_t ny = 512;
  const size_t nz = 512;
  float *d_src, *h_dst;
  cudaMalloc(&d_src, nx*ny*nz*sizeof(float));
  cudaHostAlloc(&h_dst, nx*nz*sizeof(float), cudaHostAllocDefault);
  cudaMemset(d_src, 0, nx*ny*nz*sizeof(float));
  memset(h_dst, 1, nx*nz*sizeof(float));

  cudaMemcpy3DParms cpy_params = {0};
  cpy_params.srcPtr = make_cudaPitchedPtr(d_src, nx*sizeof(float), nx, ny);
  cpy_params.dstPtr = make_cudaPitchedPtr(h_dst, nx*sizeof(float), nx, 1);
  cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 1, (size_t) 0);
  cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.extent = make_cudaExtent(nx*sizeof(float), 1, nz);
  cpy_params.kind   = cudaMemcpyDeviceToHost;

  for (int i=0; i<100; i++){
    cudaMemcpy3DAsync(&cpy_params);
  }

  cudaDeviceSynchronize();
}

Thanks!!

Last time I looked at this, it seemed that there is little benefit to using those APIs in over just using mempcyAsync. Currently we don’t have these exposed since you also need to use cudaMalloc3D.

They are currently on wrapped in CUDAdrv.jl so if you want to play with these APIs the first thing to do is to wrap them there and then use them in the right places in CuArrays.

Thanks @vchuravy.

Last time I looked at this, it seemed that there is little benefit to using those APIs in over just using mempcyAsync.

OK, so I if I understand right, you suggest multiple calls to mempcyAsync to achieve what a single call to cudaMempcyAsync3D in CUDA C would do. I do very well believe that this gives a similar performance. Could you indicate me which function you concretely refer to in CuArrays/CUDAnative/CUDAdrv / give a minimal example?

Currently we don’t have these exposed since you also need to use cudaMalloc3D.

Just for the record: it might be common / recommended to use cudaMalloc3D in conjunction with cudaMempcyAsync3D; however, the above example is a working use case without cudaMalloc3D (using just make_cudaPitchedPtr to create the necessary pitched pointer structs from normal pointers) achieving a host do device transfer throughput of about 12 GB/s in my case (16 PCIe 3.0 lanes, NVIDIA® Tesla® P100). Yet again, I do well believe that using just multiple mempcyAsync one can get the same performance as you suggest.

@vchuravy, I couldn’t find any higher level way to do a async memory copy than with CUDAdrv.Mem.copy!. Assuming this is the only way, can anyone please tell me how to specify a pointer offset?
– I can copy data from a device array to a host array (see the example at the end of the post), but I cannot find how to do a copy starting from an element that is not the first.

Could any please give me some hints. I am really stuck here and I could not find any example searching through source codes of CUDAdrv and CUDAnative and through related git PRs… Maybe @maleadt, you could help out?

Thanks!

$> julia
julia> using CUDAdrv, CUDAnative, CuArrays

julia> function register(A)
          A_buf = Mem.register(Mem.Host, pointer(A), sizeof(A), Mem.HOSTREGISTER_DEVICEMAP)
          A_gpuptr = convert(CuPtr{Float64}, A_buf)
          return unsafe_wrap(CuArray, A_gpuptr, size(A));
      end
register (generic function with 1 method)

julia> nx=2; ny=2; nz=2;

julia> A = zeros(ny,nz);

julia> B = rand(nx,ny,nz);

julia> A_d = register(A);

julia> B = CuArray(B);

julia> B
2×2×2 CuArray{Float64,3}:
[:, :, 1] =
0.262772  0.854401
0.715771  0.750954

[:, :, 2] =
0.91808   0.379343
0.861054  0.6118  

julia> A_d
2×2 CuArray{Float64,2}:
0.0  0.0
0.0  0.0

julia> Mem.copy!(A_d.buf, B.buf, 3*sizeof(Float64), async=true, stream=CuStream(CUDAdrv.STREAM_NON_BLOCKING));

julia> A_d
2×2 CuArray{Float64,2}:
0.262772  0.854401
0.715771  0.0     

A bit clunky, but you can always do:

julia> B
2×2×2 CuArray{Float64,3}:
[:, :, 1] =
 0.982322  0.865064
 0.439428  0.134357

[:, :, 2] =
 0.668101  0.291983
 0.845549  0.892549

...

julia> buf′ = similar(B.buf, pointer(B.buf) + 4*sizeof(Float64), sizeof(B.buf) - 4*sizeof(Float64))
CUDAdrv.Mem.DeviceBuffer(CuPtr{Nothing}(0x0000000b03bc0020), 32, CuContext(Ptr{Nothing} @0x00000000027467c0, false, true))

julia> Mem.copy!(A_d.buf, buf′, 3*sizeof(Float64), async=true, stream=CuStream(CUDAdrv.STREAM_NON_BLOCKING));

julia> A_d
2×2 CuArray{Float64,2}:
 0.668101  0.291983
 0.845549  0.0

Concerning the 3D memcpy, keep us posted wrt. the performance. Admittedly, I’ve never really had the need to use any of these API calls, is why they probably aren’t wrapped. But if there’s really a performance difference we should, at the least, expose them in CUDAdrv.jl for manual use.

Thanks @maleadt!
Sure, I will let you know in case I get significantly better performance with 3D memcopy than with the 1D memcopy. Yet, I doubt that as the 3D memcopy doesn’t perform actually well for very strided copies…

@maleadt, I was probably too fast to say that we can get with multiple calls to 1-D memcopy the same performance as with a 3-D memcopy. I changed the CUDA C code in the topic description to do the same device to host copy of a XZ plane of a 3-D array instead with 512 calls to 1-D memcopy (when I did all on the same stream the result was about the same):

int main(){
  const size_t nx = 512;
  const size_t ny = 512;
  const size_t nz = 512;
  float *d_src, *h_dst;
  cudaStream_t streams[nz];
  cudaMalloc(&d_src, nx*ny*nz*sizeof(float));
  cudaHostAlloc(&h_dst, nx*nz*sizeof(float), cudaHostAllocDefault);
  cudaMemset(d_src, 0, nx*ny*nz*sizeof(float));
  memset(h_dst, 1, nx*nz*sizeof(float));
  for (int iz=0; iz<nz; iz++){
    cudaStreamCreateWithFlags(&streams[iz], cudaStreamNonBlocking);
  }
  for (int i=0; i<100; i++){
    for (int iz=0; iz<nz; iz++){
      cudaMemcpyAsync(&h_dst[iz*nx], &d_src[iz*nx*ny], nx*sizeof(float), cudaMemcpyDeviceToHost, streams[iz]);
    }
  }

  cudaDeviceSynchronize();
}

It is about 8 times slower, that is, it reaches only about 1.5 GB/s[1] (67 ms for 100 iterations), while the version with 3D copies reaches 13 GB/s (8 ms for 100 iterations).

With the Julia code that I did first, I got about the same throughput as with this CUDA C code using 1-D copies (1.5 GB/s).

Could you therefore add the 3-D memcopy to CUDAnative/CUDAdrv?

Thanks!

[1] Giga is here 1e9 as in the Nvidia bandwidth test.

Your 1D memcpy example segfaults here.

Please open an issue on CUDAdrv.jl to track the implementation.

I am sorry, I did that in a hurry on Friday evening: I mixed up the stride of the source and destination. It should be the opposite. I.e instead of

cudaMemcpyAsync(&h_dst[iz*nx*ny], &d_src[iz*nx], nx*sizeof(float), cudaMemcpyDeviceToHost, streams[iz]);

it should be:

cudaMemcpyAsync(&h_dst[iz*nx], &d_src[iz*nx*ny], nx*sizeof(float), cudaMemcpyDeviceToHost, streams[iz]);

I changed it in the code above.

For some unknown reason it did not segfault on my GPU (I run it again now and it still did not segfault)…

I will open an issue on CUDAdrv, thanks!

I created a CUDAdrv issue for it.

(I posted this to a wrong thread by mistake; so I copy it here with @maleadt’s reply)
@maleadt, is there a quick way to do 3-D memcopy now in Julia? You know I would like to use that to still improve the halo update of our real-world application before our talk at JuliaCon next week…
Thanks!

@maleadt’s reply:
Have a look at https://github.com/JuliaGPU/CUDAdrv.jl/issues/149#issuecomment-511338109