How do I get allocations down but keep code speed?

Ahmed_Salih · January 28, 2023, 2:08pm

using CUDA, StaticArrays, Flux

N = 10^6
T = Float32

WgL = CuArray(rand(SVector{3,T},N))

I_ = CuArray(rand(1:6195,N))
J_ = CuArray(rand(1:6195,N))

WGL = CuArray(zeros(SVector{3,T},N))

CUDA.@allocated WGL .= WgL[I_] - WgL[J_]
36000000

CUDA.@allocated WGL .= WgL[I_] .- WgL[J_]
24000000

@CUDA.allocated WGL .= view(WgL, I_) .- view(WgL, J_)
0

Where the view approach fixes the issue with allocations, but as I keep running my simulation the view approach for some reason becomes much slower. Anyone knows how to do this computation properly?

Kind regards

Ahmed_Salih · January 28, 2023, 2:59pm

I guess what I am trying to ask;

How do I broadcast properly on GPU and save operations in-place without intermediate arrays, as this syntax would have done on CPU?

Kind regards

stevengj · January 28, 2023, 4:41pm

You can always write your own loop, no? Introduction · CUDA.jl

Ahmed_Salih · January 28, 2023, 4:47pm

Yes, but this seems to be an extremely basic operation?

Would I really need to make a kernel for this?

I am sure I could figure out how to do the kernel in a day or two, but as far as I know, Julia has been broadcasted as a programming language with “first class” support for native GPU operations.

I must be doing something wrong because I would be very surprised if there is no way to do this without allocating.

Kind regards

stevengj · January 28, 2023, 7:42pm

Being able to write your own high-performance kernels, rather than being limited to a small menu of “vectorized built-ins”, is first-class support.

There may be a way to do what you want using built in CUDA.jl functions, of course; I’m not an expert in that package. But in the long run writing custom loops will give maximum flexibility and performance.

ToucheSir · January 28, 2023, 8:03pm

Before going any further, it’s necessary to know what behaviour you want out of this operation. Currently, the way you’re generating indices means that there will be duplicates. I presume you don’t want multiple values from the source array overwriting each other in the destination array, so what is supposed to happen when there are duplicates? The answer to this will dictate what functions you can use and whether you’ll need to write a custom kernel.

Ahmed_Salih · January 28, 2023, 8:16pm

For now, I just want to perform the calculation shown in the top of the post.

You are right, that there is a multiple of the same calculation due to the nature of the provided example, but in my code I am debugging I_ and J_ will hold unique combinations only. So what is provided is just an example.

Ideally, I just want to understand why I see allocations using CUDA.@allocated, when it is a in-place operation, because using @CUDA.time, I don’t see allocs. So I am just confused

Kind regards

DNF · January 28, 2023, 8:19pm

What do you mean? You just showed that there are zero allocations?

Ahmed_Salih · January 28, 2023, 8:24pm

Yes, true. I left out why I was confused

So when I used the approach you quote with a view, then for some reason all allocations were instead stored on CPU side, which meant that the code became much slower. I found this through use of @CUDA.time.

That was not part of the original question of course, but my question right now is;

Why is it slower when I use a view with “0” allocations? Is it due to the CPU taking over for some reason?

Perhaps I just start looking into learning how to make my own kernels and see if I understand better after

Kind regards

ToucheSir · January 28, 2023, 9:11pm

I’m not sure if the CUDA.jl docs mention this, but a good practice is to always call CUDA.allowscalar(false) before working with GPU code so that these accesses become errors instead of silently slowing down your code (assuming you didn’t see the scalar indexing warning).

Back on topic, the underlying reason for the slowness is that CUDA.jl doesn’t support many types of non-contiguous views. Because your I_ and J_ are not the same in the MWE and you do some operation (-) before writing to WGL, there’s not much you can do other than write your own kernel here. If any of those requirements don’t hold in your actual task, then it’s possible there’s a solution which wouldn’t require any custom kernels.

Topic		Replies	Views
Inplace array modification performances GPU	4	619	July 20, 2022
Dot-product of CuArray views is slow GPU performance , memory-allocation , views	10	1539	May 11, 2021
Why does GPU allocate so much for me? GPU	1	257	January 28, 2023
CUDA arrays not working well with broadcast!(), and other in-place operations inside a loop GPU gpu , broadcast , loops	4	738	June 1, 2022
Correct utilisation of CUDA kernel for simulations GPU	16	580	February 13, 2025

How do I get allocations down but keep code speed?

Related topics