CuArrays/CUDAnative/CUDAdrv equivalent to cudaMemcpy3DAsync?

samo · July 9, 2019, 10:36am

What is the CuArrays/CUDAnative/CUDAdrv equivalent to cudaMemcpy3DAsync?

More precisely, I am looking for how to do the equivalent of the following CUDA C code, which copies a XZ-plane of a 3-D device array to a contiguous 1-D host array:

int main(){
  const size_t nx = 512;
  const size_t ny = 512;
  const size_t nz = 512;
  float *d_src, *h_dst;
  cudaMalloc(&d_src, nx*ny*nz*sizeof(float));
  cudaHostAlloc(&h_dst, nx*nz*sizeof(float), cudaHostAllocDefault);
  cudaMemset(d_src, 0, nx*ny*nz*sizeof(float));
  memset(h_dst, 1, nx*nz*sizeof(float));

  cudaMemcpy3DParms cpy_params = {0};
  cpy_params.srcPtr = make_cudaPitchedPtr(d_src, nx*sizeof(float), nx, ny);
  cpy_params.dstPtr = make_cudaPitchedPtr(h_dst, nx*sizeof(float), nx, 1);
  cpy_params.srcPos = make_cudaPos((size_t) 0, (size_t) 1, (size_t) 0);
  cpy_params.dstPos = make_cudaPos((size_t) 0, (size_t) 0, (size_t) 0);
  cpy_params.extent = make_cudaExtent(nx*sizeof(float), 1, nz);
  cpy_params.kind   = cudaMemcpyDeviceToHost;

  for (int i=0; i<100; i++){
    cudaMemcpy3DAsync(&cpy_params);
  }

  cudaDeviceSynchronize();
}

Thanks!!

vchuravy · July 9, 2019, 6:28pm

Last time I looked at this, it seemed that there is little benefit to using those APIs in over just using mempcyAsync. Currently we don’t have these exposed since you also need to use cudaMalloc3D.

They are currently on wrapped in CUDAdrv.jl so if you want to play with these APIs the first thing to do is to wrap them there and then use them in the right places in CuArrays.

samo · July 10, 2019, 7:57am

Thanks @vchuravy.

Last time I looked at this, it seemed that there is little benefit to using those APIs in over just using mempcyAsync.

OK, so I if I understand right, you suggest multiple calls to mempcyAsync to achieve what a single call to cudaMempcyAsync3D in CUDA C would do. I do very well believe that this gives a similar performance. Could you indicate me which function you concretely refer to in CuArrays/CUDAnative/CUDAdrv / give a minimal example?

Currently we don’t have these exposed since you also need to use cudaMalloc3D.

Just for the record: it might be common / recommended to use cudaMalloc3D in conjunction with cudaMempcyAsync3D; however, the above example is a working use case without cudaMalloc3D (using just make_cudaPitchedPtr to create the necessary pitched pointer structs from normal pointers) achieving a host do device transfer throughput of about 12 GB/s in my case (16 PCIe 3.0 lanes, NVIDIA® Tesla® P100). Yet again, I do well believe that using just multiple mempcyAsync one can get the same performance as you suggest.

samo · July 12, 2019, 2:10pm

@vchuravy, I couldn’t find any higher level way to do a async memory copy than with CUDAdrv.Mem.copy!. Assuming this is the only way, can anyone please tell me how to specify a pointer offset?
– I can copy data from a device array to a host array (see the example at the end of the post), but I cannot find how to do a copy starting from an element that is not the first.

Could any please give me some hints. I am really stuck here and I could not find any example searching through source codes of CUDAdrv and CUDAnative and through related git PRs… Maybe @maleadt, you could help out?

Thanks!

$> julia
julia> using CUDAdrv, CUDAnative, CuArrays

julia> function register(A)
          A_buf = Mem.register(Mem.Host, pointer(A), sizeof(A), Mem.HOSTREGISTER_DEVICEMAP)
          A_gpuptr = convert(CuPtr{Float64}, A_buf)
          return unsafe_wrap(CuArray, A_gpuptr, size(A));
      end
register (generic function with 1 method)

julia> nx=2; ny=2; nz=2;

julia> A = zeros(ny,nz);

julia> B = rand(nx,ny,nz);

julia> A_d = register(A);

julia> B = CuArray(B);

julia> B
2×2×2 CuArray{Float64,3}:
[:, :, 1] =
0.262772  0.854401
0.715771  0.750954

[:, :, 2] =
0.91808   0.379343
0.861054  0.6118  

julia> A_d
2×2 CuArray{Float64,2}:
0.0  0.0
0.0  0.0

julia> Mem.copy!(A_d.buf, B.buf, 3*sizeof(Float64), async=true, stream=CuStream(CUDAdrv.STREAM_NON_BLOCKING));

julia> A_d
2×2 CuArray{Float64,2}:
0.262772  0.854401
0.715771  0.0

maleadt · July 12, 2019, 2:26pm

A bit clunky, but you can always do:

julia> B
2×2×2 CuArray{Float64,3}:
[:, :, 1] =
 0.982322  0.865064
 0.439428  0.134357

[:, :, 2] =
 0.668101  0.291983
 0.845549  0.892549

...

julia> buf′ = similar(B.buf, pointer(B.buf) + 4*sizeof(Float64), sizeof(B.buf) - 4*sizeof(Float64))
CUDAdrv.Mem.DeviceBuffer(CuPtr{Nothing}(0x0000000b03bc0020), 32, CuContext(Ptr{Nothing} @0x00000000027467c0, false, true))

julia> Mem.copy!(A_d.buf, buf′, 3*sizeof(Float64), async=true, stream=CuStream(CUDAdrv.STREAM_NON_BLOCKING));

julia> A_d
2×2 CuArray{Float64,2}:
 0.668101  0.291983
 0.845549  0.0

Concerning the 3D memcpy, keep us posted wrt. the performance. Admittedly, I’ve never really had the need to use any of these API calls, is why they probably aren’t wrapped. But if there’s really a performance difference we should, at the least, expose them in CUDAdrv.jl for manual use.

samo · July 12, 2019, 2:37pm

Thanks @maleadt!
Sure, I will let you know in case I get significantly better performance with 3D memcopy than with the 1D memcopy. Yet, I doubt that as the 3D memcopy doesn’t perform actually well for very strided copies…

samo · July 12, 2019, 5:28pm

@maleadt, I was probably too fast to say that we can get with multiple calls to 1-D memcopy the same performance as with a 3-D memcopy. I changed the CUDA C code in the topic description to do the same device to host copy of a XZ plane of a 3-D array instead with 512 calls to 1-D memcopy (when I did all on the same stream the result was about the same):

int main(){
  const size_t nx = 512;
  const size_t ny = 512;
  const size_t nz = 512;
  float *d_src, *h_dst;
  cudaStream_t streams[nz];
  cudaMalloc(&d_src, nx*ny*nz*sizeof(float));
  cudaHostAlloc(&h_dst, nx*nz*sizeof(float), cudaHostAllocDefault);
  cudaMemset(d_src, 0, nx*ny*nz*sizeof(float));
  memset(h_dst, 1, nx*nz*sizeof(float));
  for (int iz=0; iz<nz; iz++){
    cudaStreamCreateWithFlags(&streams[iz], cudaStreamNonBlocking);
  }
  for (int i=0; i<100; i++){
    for (int iz=0; iz<nz; iz++){
      cudaMemcpyAsync(&h_dst[iz*nx], &d_src[iz*nx*ny], nx*sizeof(float), cudaMemcpyDeviceToHost, streams[iz]);
    }
  }

  cudaDeviceSynchronize();
}

It is about 8 times slower, that is, it reaches only about 1.5 GB/s[1] (67 ms for 100 iterations), while the version with 3D copies reaches 13 GB/s (8 ms for 100 iterations).

With the Julia code that I did first, I got about the same throughput as with this CUDA C code using 1-D copies (1.5 GB/s).

Could you therefore add the 3-D memcopy to CUDAnative/CUDAdrv?

Thanks!

[1] Giga is here 1e9 as in the Nvidia bandwidth test.

maleadt · July 12, 2019, 8:03pm

Your 1D memcpy example segfaults here.

Please open an issue on CUDAdrv.jl to track the implementation.

samo · July 15, 2019, 7:20am

I am sorry, I did that in a hurry on Friday evening: I mixed up the stride of the source and destination. It should be the opposite. I.e instead of

cudaMemcpyAsync(&h_dst[iz*nx*ny], &d_src[iz*nx], nx*sizeof(float), cudaMemcpyDeviceToHost, streams[iz]);

it should be:

cudaMemcpyAsync(&h_dst[iz*nx], &d_src[iz*nx*ny], nx*sizeof(float), cudaMemcpyDeviceToHost, streams[iz]);

I changed it in the code above.

For some unknown reason it did not segfault on my GPU (I run it again now and it still did not segfault)…

I will open an issue on CUDAdrv, thanks!

samo · July 15, 2019, 7:37am

I created a CUDAdrv issue for it.

samo · July 15, 2019, 9:52am

(I posted this to a wrong thread by mistake; so I copy it here with @maleadt’s reply)
@maleadt, is there a quick way to do 3-D memcopy now in Julia? You know I would like to use that to still improve the halo update of our real-world application before our talk at JuliaCon next week…
Thanks!

@maleadt’s reply:
Have a look at https://github.com/JuliaGPU/CUDAdrv.jl/issues/149#issuecomment-511338109

Topic		Replies	Views
CUDAnative: register host memory for pinned memory access GPU question	26	4112	September 3, 2021
Can I move an array asynchronously from main program to CUDA? GPU gpu , gpuarrays , cuda	7	215	December 15, 2024
CUDAnative/CuArrays: performance regression for memcopy code GPU question	3	866	April 18, 2019
cudaMemcpyAsync: where is it used? GPU cuda	17	463	January 14, 2025
Efficiency when handling jobs larger than VRAM GPU	3	1071	January 15, 2019

CuArrays/CUDAnative/CUDAdrv equivalent to cudaMemcpy3DAsync?

Related topics