How do I get allocations down but keep code speed?

using CUDA, StaticArrays, Flux

N = 10^6
T = Float32

WgL = CuArray(rand(SVector{3,T},N))

I_ = CuArray(rand(1:6195,N))
J_ = CuArray(rand(1:6195,N))

WGL = CuArray(zeros(SVector{3,T},N))

CUDA.@allocated WGL .= WgL[I_] - WgL[J_]

CUDA.@allocated WGL .= WgL[I_] .- WgL[J_]

@CUDA.allocated WGL .= view(WgL, I_) .- view(WgL, J_)

Where the view approach fixes the issue with allocations, but as I keep running my simulation the view approach for some reason becomes much slower. Anyone knows how to do this computation properly?

Kind regards

I guess what I am trying to ask;

How do I broadcast properly on GPU and save operations in-place without intermediate arrays, as this syntax would have done on CPU?

Kind regards

You can always write your own loop, no? Introduction · CUDA.jl

Yes, but this seems to be an extremely basic operation?

Would I really need to make a kernel for this?

I am sure I could figure out how to do the kernel in a day or two, but as far as I know, Julia has been broadcasted as a programming language with “first class” support for native GPU operations.

I must be doing something wrong because I would be very surprised if there is no way to do this without allocating.

Kind regards

Being able to write your own high-performance kernels, rather than being limited to a small menu of “vectorized built-ins”, is first-class support.

There may be a way to do what you want using built in CUDA.jl functions, of course; I’m not an expert in that package. But in the long run writing custom loops will give maximum flexibility and performance.


Before going any further, it’s necessary to know what behaviour you want out of this operation. Currently, the way you’re generating indices means that there will be duplicates. I presume you don’t want multiple values from the source array overwriting each other in the destination array, so what is supposed to happen when there are duplicates? The answer to this will dictate what functions you can use and whether you’ll need to write a custom kernel.

For now, I just want to perform the calculation shown in the top of the post.

You are right, that there is a multiple of the same calculation due to the nature of the provided example, but in my code I am debugging I_ and J_ will hold unique combinations only. So what is provided is just an example.

Ideally, I just want to understand why I see allocations using CUDA.@allocated, when it is a in-place operation, because using @CUDA.time, I don’t see allocs. So I am just confused

Kind regards

What do you mean? You just showed that there are zero allocations?

Yes, true. I left out why I was confused :sweat_smile:

So when I used the approach you quote with a view, then for some reason all allocations were instead stored on CPU side, which meant that the code became much slower. I found this through use of @CUDA.time.

That was not part of the original question of course, but my question right now is;

Why is it slower when I use a view with “0” allocations? Is it due to the CPU taking over for some reason?

Perhaps I just start looking into learning how to make my own kernels and see if I understand better after :slight_smile:

Kind regards

I’m not sure if the CUDA.jl docs mention this, but a good practice is to always call CUDA.allowscalar(false) before working with GPU code so that these accesses become errors instead of silently slowing down your code (assuming you didn’t see the scalar indexing warning).

Back on topic, the underlying reason for the slowness is that CUDA.jl doesn’t support many types of non-contiguous views. Because your I_ and J_ are not the same in the MWE and you do some operation (-) before writing to WGL, there’s not much you can do other than write your own kernel here. If any of those requirements don’t hold in your actual task, then it’s possible there’s a solution which wouldn’t require any custom kernels.