Hey Julianners,
Can you help me on how to create fast array addition?
I believe this is the fastest way:
using CUDA
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 1.00
add_cab(c, a, b) = @inbounds begin
I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
I > size(c, 1) && return
ca = CUDA.Const(a)
cb = CUDA.Const(b)
c[I] += ca[I] + cb[I]
nothing
end
nth=512
N=1000_000
CG = CUDA.ones(Float32, N)
AG = CUDA.ones(Float32, N)
BG = CUDA.ones(Float32, N)
ITER_GFLOPS = N * 2 / 1000_000_000
@sync @cuda threads=nth blocks=cld(N, nth) add_cab(CG, AG, BG)
b = @belapsed @sync @cuda threads=nth blocks=cld(N, nth) add_cab(CG, AG, BG)
println(ITER_GFLOPS / b) # 3090TI: 387.48 GFlops (Basically this very much sound like a bandwidth problem (the theoretical 40.000Gflops is about 400GFlops).)
Instead of 40.000Gflops it is 387Gflops.
Or do I do something wrong?
(Note: I know that with FMA I can reach 2x and also with using registry and run the calculation in a for loop 200 times I could reach better speed, but I am not interested in those scenarios.)
I don’t think the bottleneck is compute, but rather to the memory bandwidth. The 3090Ti has about 1000 GB/s, IIRC, so divided by 4 bytes per Float32, and 3 read/writes per operation, that gives about 80 Gflops if nothing is in cache, but higher if (parts of) the arrays are already in cache, which depends on surrounding code and size of arrays.
I tried some other ideas, with chatGPT, and the two things it added are the use of TensorCores, and compiler optimization to -O3, but idk if the latter matters on GPU.
Also, it notes that TensorCores are sometimes not possible to be used, in this simple example we could use that, but its pretty tedious to use.
Also as for tensorcores this is the example code I found on the internet, which pretty much doesn’t work.
using CUDA
# Generate input matrices
a = rand(Float16, (16, 16))
a_dev = CuArray(a)
b = rand(Float16, (16, 16))
b_dev = CuArray(b)
c = rand(Float32, (16, 16))
c_dev = CuArray(c)
# Allocate space for result
d_dev = similar(c_dev)
# Matrix multiply-accumulate kernel (D = A * B + C)
function kernel(a_dev, b_dev, c_dev, d_dev)
a_frag = WMMA.llvm_wmma_load_a_col_m16n16k16_stride_f16(pointer(a_dev), 16)
b_frag = WMMA.llvm_wmma_load_b_col_m16n16k16_stride_f16(pointer(b_dev), 16)
c_frag = WMMA.llvm_wmma_load_c_col_m16n16k16_stride_f32(pointer(c_dev), 16)
d_frag = WMMA.llvm_wmma_mma_col_col_m16n16k16_f32_f32(a_frag, b_frag, c_frag)
WMMA.llvm_wmma_store_d_col_m16n16k16_stride_f32(pointer(d_dev), d_frag, 16)
return
end
@cuda threads=32 kernel(a_dev, b_dev, c_dev, d_dev)
@Sixzero I love the answer, but does it gives better speed?
I don’t think the bottleneck is compute, but rather to the memory bandwidth. The 3090Ti has about 1000 GB/s, IIRC, so divided by 4 bytes per Float32, and 3 read/writes per operation, that gives about 80 Gflops if nothing is in cache, but higher if (parts of) the arrays are already in cache, which depends on surrounding code and size of arrays.
Yes It should be memory bandwidth bottleneck, my problem is that, it is extremly slow compared to the marketed 40.000GFlops theoretical value. And feels totally misleading in the end as basically for a single addition cannot be faster than 1% of the theoretical speed.
Or am I missing something basically, that is my question…
Just to make sure of the memory boundedness of the calculation, can you replace one or both of the arrays with computed counters (i.e. no memory transfer needed) and see the performance increase?
The problem using temporary values is that, even with that, you still have to write into the appropriate memory. So I don’t understand how could I improve the memory usage or something, to minimize the memory bandwidth. I tried to use strides… and many thing… but I couldn’t reach higher GFlops… It is disappointing.
So I had some hope in better cache usage or something that I just cannot achieve. Doesn’t this simple problem can be faster?
Or can shared_memory help us in any certain way? I guess it is useful at problems like reduce.
If we could create the fastest addition here we could improve our codes everywhere based on this one good example.
I do mean() and many more custom transformation. And also need to use the value of the summations and special structs, so it is barely possible to fuse it (in spite of there is a very bad and ugly way to do it to use only 1 block and do some array flattening).
So that isn’t viable for me, and it is really slow that way.
I am interested about weather we can make this addition faster so I could make 10-100x speed up if we can use the caching effectively or something.
There’s no magical way to “better use memory”; your kernel is purely reading from and writing to global memory, so is going to be horribly memory bound. The theoretical GFlops are irrelevant here. You’re supposed to increase arithmetic intensity, as @Oscar_Smith mentions, e.g. by fusing kernels together.
Yeah, and memory access in a coalescable way seems like the only other thing that could still matter.
I don’t know if shared memory or things like that possible to be used in any way, would be curious if someone could link some good example usage for them!