CUDA CPU allocations with range

roflmaostc · January 13, 2022, 12:57pm

Hi!

I was wondering, why CuArray of a range allocates the result on the CPU and moves it to the GPU afterwards? If I do the same directly with a broadcast call, it works flawless.
Any suggestion how to prevent that?

julia> using CUDA

julia> xc = CUDA.rand(10_000); 

julia> CUDA.@time xc .= 1:10_000;  # already compiled
  0.000097 seconds (7 CPU allocations: 480 bytes) 

julia> CUDA.@time CuArray(1:10_000);  # already compiled, why CPU allocations?
  0.000051 seconds (8 CPU allocations: 78.344 KiB) (1 GPU allocation: 78.125 KiB, 22.71% memmgmt time) 

julia> CUDA.@time Array(1:10_000); 
  0.000010 seconds (2 CPU allocations: 78.172 KiB)

Best,

Felix

maleadt · January 13, 2022, 1:03pm

That CuArray constructor only accepts Array inputs, and converts whatever you pass it to to an Array. That’s more robust than trying to perform that operation on the GPU (the object might be mutable, for example). And constructing a CuArray isn’t supposed to be the expensive operation worth accelerating on the device; operations with the resulting object are.

roflmaostc · January 13, 2022, 1:04pm

So the suggested way (e.g. a range) is to write .=?

maleadt · January 13, 2022, 1:08pm

Basically yes. That’s what such a constructor would do (and used to do at some point in the past). A custom kernel for ranges would probably outperform broadcast, but again this is unlikely to be a performance-critical operation.

roflmaostc · January 13, 2022, 1:11pm

Is there something like a convenient function?

function foo(AA::AbstractArray{T, N}) where {T, N}
    AAc = CUDA.Array{T, N}{undef, size(AA)...)
    AAc .= AA
    return AAc?
end

maleadt · January 13, 2022, 1:15pm

I don’t think 2 lines of code warrant a convenience function? If you care you can always create an additional outer constructor, a la https://github.com/JuliaGPU/CUDA.jl/blob/e00ad245d96763ebac4cea2ce4a9c7d7b722bd58/src/array.jl#L286-L290. If this really has a significant impact you can create a PR on CUDA.jl

Topic		Replies	Views
Performing vector operations on a specific range of a CuArray GPU question , gpu , cuda	2	653	July 7, 2020
The performance difference of transferring (SubArray, ReshapedArray) Array to GPU GPU flux	2	644	November 20, 2019
How do I get allocations down but keep code speed? GPU	9	330	January 28, 2023
Error when mapping a CuVector{UnitRange} to index another gpu Array GPU	8	531	October 28, 2021
Map Performance with CuArrays GPU question , fftw , cuda , broadcast	15	5176	January 4, 2021

CUDA CPU allocations with range

Related topics