CUDA CPU allocations with range


I was wondering, why CuArray of a range allocates the result on the CPU and moves it to the GPU afterwards? If I do the same directly with a broadcast call, it works flawless.
Any suggestion how to prevent that?

julia> using CUDA

julia> xc = CUDA.rand(10_000); 

julia> CUDA.@time xc .= 1:10_000;  # already compiled
  0.000097 seconds (7 CPU allocations: 480 bytes) 

julia> CUDA.@time CuArray(1:10_000);  # already compiled, why CPU allocations?
  0.000051 seconds (8 CPU allocations: 78.344 KiB) (1 GPU allocation: 78.125 KiB, 22.71% memmgmt time) 

julia> CUDA.@time Array(1:10_000); 
  0.000010 seconds (2 CPU allocations: 78.172 KiB)



That CuArray constructor only accepts Array inputs, and converts whatever you pass it to to an Array. That’s more robust than trying to perform that operation on the GPU (the object might be mutable, for example). And constructing a CuArray isn’t supposed to be the expensive operation worth accelerating on the device; operations with the resulting object are.

1 Like

So the suggested way (e.g. a range) is to write .=?

Basically yes. That’s what such a constructor would do (and used to do at some point in the past). A custom kernel for ranges would probably outperform broadcast, but again this is unlikely to be a performance-critical operation.

Is there something like a convenient function?

function foo(AA::AbstractArray{T, N}) where {T, N}
    AAc = CUDA.Array{T, N}{undef, size(AA)...)
    AAc .= AA
    return AAc?

I don’t think 2 lines of code warrant a convenience function? If you care you can always create an additional outer constructor, a la CUDA.jl/array.jl at e00ad245d96763ebac4cea2ce4a9c7d7b722bd58 · JuliaGPU/CUDA.jl · GitHub. If this really has a significant impact you can create a PR on CUDA.jl :slightly_smiling_face:

1 Like