Fastest way to add arrays

Marcell_Havlik · December 13, 2022, 10:21am

Hey Julianners,
Can you help me on how to create fast array addition?
I believe this is the fastest way:

using CUDA
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 1.00
add_cab(c,  a,  b) = @inbounds begin
  I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  I > size(c, 1) && return
	ca = CUDA.Const(a)
	cb = CUDA.Const(b)
	c[I] += ca[I] + cb[I] 
	nothing
end
nth=512
N=1000_000               
CG = CUDA.ones(Float32, N)  
AG = CUDA.ones(Float32, N)  
BG = CUDA.ones(Float32, N)  
ITER_GFLOPS = N * 2 / 1000_000_000
@sync @cuda threads=nth blocks=cld(N, nth)  add_cab(CG, AG, BG)
b = @belapsed @sync @cuda threads=nth blocks=cld(N, nth)  add_cab(CG, AG, BG)
println(ITER_GFLOPS / b) # 3090TI: 387.48 GFlops  (Basically this very much sound like a bandwidth problem (the theoretical 40.000Gflops is about 400GFlops).)

Instead of 40.000Gflops it is 387Gflops.
Or do I do something wrong?

https://www.reddit.com/r/CUDA/comments/zksbpx/fastest_way_to_add_arrays/

(Note: I know that with FMA I can reach 2x and also with using registry and run the calculation in a for loop 200 times I could reach better speed, but I am not interested in those scenarios.)

Sixzero · December 13, 2022, 10:44am

As a general answer by ChatGPT, sry didn’t want to increase the spam on the discussion:

Its usually just bullshit what it can generate.

Per · December 13, 2022, 10:49am

I don’t think the bottleneck is compute, but rather to the memory bandwidth. The 3090Ti has about 1000 GB/s, IIRC, so divided by 4 bytes per Float32, and 3 read/writes per operation, that gives about 80 Gflops if nothing is in cache, but higher if (parts of) the arrays are already in cache, which depends on surrounding code and size of arrays.

Sixzero · December 13, 2022, 10:54am

I tried some other ideas, with chatGPT, and the two things it added are the use of TensorCores, and compiler optimization to -O3, but idk if the latter matters on GPU.
Also, it notes that TensorCores are sometimes not possible to be used, in this simple example we could use that, but its pretty tedious to use.

Also as for tensorcores this is the example code I found on the internet, which pretty much doesn’t work.

using CUDA

# Generate input matrices
a     = rand(Float16, (16, 16))
a_dev = CuArray(a)
b     = rand(Float16, (16, 16))
b_dev = CuArray(b)
c     = rand(Float32, (16, 16))
c_dev = CuArray(c)

# Allocate space for result
d_dev = similar(c_dev)

# Matrix multiply-accumulate kernel (D = A * B + C)
function kernel(a_dev, b_dev, c_dev, d_dev)
    a_frag = WMMA.llvm_wmma_load_a_col_m16n16k16_stride_f16(pointer(a_dev), 16)
    b_frag = WMMA.llvm_wmma_load_b_col_m16n16k16_stride_f16(pointer(b_dev), 16)
    c_frag = WMMA.llvm_wmma_load_c_col_m16n16k16_stride_f32(pointer(c_dev), 16)

    d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k16_f32_f32(a_frag, b_frag, c_frag)

    WMMA.llvm_wmma_store_d_col_m16n16k16_stride_f32(pointer(d_dev), d_frag, 16)
    return
end

@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)

Marcell_Havlik · December 13, 2022, 11:36am

@Sixzero I love the answer, but does it gives better speed?

I don’t think the bottleneck is compute, but rather to the memory bandwidth. The 3090Ti has about 1000 GB/s, IIRC, so divided by 4 bytes per Float32, and 3 read/writes per operation, that gives about 80 Gflops if nothing is in cache, but higher if (parts of) the arrays are already in cache, which depends on surrounding code and size of arrays.

Yes It should be memory bandwidth bottleneck, my problem is that, it is extremly slow compared to the marketed 40.000GFlops theoretical value. And feels totally misleading in the end as basically for a single addition cannot be faster than 1% of the theoretical speed.
Or am I missing something basically, that is my question…

Dan · December 13, 2022, 11:43am

Just to make sure of the memory boundedness of the calculation, can you replace one or both of the arrays with computed counters (i.e. no memory transfer needed) and see the performance increase?

Marcell_Havlik · December 13, 2022, 12:05pm

The problem using temporary values is that, even with that, you still have to write into the appropriate memory. So I don’t understand how could I improve the memory usage or something, to minimize the memory bandwidth. I tried to use strides… and many thing… but I couldn’t reach higher GFlops… It is disappointing.

So I had some hope in better cache usage or something that I just cannot achieve. Doesn’t this simple problem can be faster?
Or can shared_memory help us in any certain way? I guess it is useful at problems like reduce.

If we could create the fastest addition here we could improve our codes everywhere based on this one good example.

Oscar_Smith · December 13, 2022, 12:37pm

gpus have teired memory so the answer is to not optimize a single addition but use a kennel that does more work.

Marcell_Havlik · December 13, 2022, 12:56pm

The problem is that to add a specified array value, you need to “read” it and the bandwidth is limited.

So the question is how can I use the bandwidth better or something.

Oscar_Smith · December 13, 2022, 12:59pm

what are you doing before and after the array add? thinking in terms of kernels rather than vectorized functions is necessary here.

Marcell_Havlik · December 13, 2022, 1:46pm

I do mean() and many more custom transformation. And also need to use the value of the summations and special structs, so it is barely possible to fuse it (in spite of there is a very bad and ugly way to do it to use only 1 block and do some array flattening).

So that isn’t viable for me, and it is really slow that way.
I am interested about weather we can make this addition faster so I could make 10-100x speed up if we can use the caching effectively or something.

maleadt · December 14, 2022, 7:33am

There’s no magical way to “better use memory”; your kernel is purely reading from and writing to global memory, so is going to be horribly memory bound. The theoretical GFlops are irrelevant here. You’re supposed to increase arithmetic intensity, as @Oscar_Smith mentions, e.g. by fusing kernels together.

Sixzero · December 14, 2022, 11:37am

Yeah, and memory access in a coalescable way seems like the only other thing that could still matter.

I don’t know if shared memory or things like that possible to be used in any way, would be curious if someone could link some good example usage for them!

Topic		Replies	Views
High performance vector/matrix/tensor linear algebra operations Performance question , performance , linearalgebra	9	543	January 20, 2023
Why does GPU addition slows down as the array get larger compared to other methods? GPU performance	7	504	August 25, 2023
Accessing array elements too slow? GPU	10	595	April 23, 2021
Faster read only memory GPU arrayfire , cudanative , cuda , memory , memory-allocation	5	1572	January 8, 2020
Performance of view with cuArrays GPU	11	2671	November 11, 2018

Fastest way to add arrays

Related topics