Hello everyone, I have recently been trying to GPUify my code.
I first tried replacing my CPU Arrays with CuArrays as recommended, however, that resulted in a slower program execution time.
So I’m trying to dig a bit deeper and seeing where the GPU not performing as well.
One bottleneck I have experienced is just a simple multiplication.
# Testing matmul using CUDA N = 1_000_000 dim = 2 A = randn(Float32, dim, dim) B = randn(Float32, dim, N) C = zeros(Float32, dim, N) AC = CUDA.randn(Float32, dim, dim) BC = CUDA.randn(Float32, dim, N) CC = CUDA.zeros(Float32, dim, N) function cpu_matmul(C, A, B) C = A * B end @time cpu_matmul(C, A, B) function gpu_matmul(CC, AC, BC) CC = AC * BC end CUDA.@time gpu_matmul(CC, AC, BC)
gives me the following output
0.005610 seconds (1.19 k allocations: 7.691 MiB) 0.017618 seconds (1.34 k CPU allocations: 68.410 KiB) (1 GPU allocation: 7.629 MiB, 0.03% gc time)
Based on this, I feel like I am doing something incorrectly. Does anyone have suggestions on increasing the GPU matmul performance?