Hello everyone, I have recently been trying to GPUify my code.

I first tried replacing my CPU Arrays with CuArrays as recommended, however, that resulted in a slower program execution time.

So I’m trying to dig a bit deeper and seeing where the GPU not performing as well.

One bottleneck I have experienced is just a simple multiplication.

```
# Testing matmul
using CUDA
N = 1_000_000
dim = 2
A = randn(Float32, dim, dim)
B = randn(Float32, dim, N)
C = zeros(Float32, dim, N)
AC = CUDA.randn(Float32, dim, dim)
BC = CUDA.randn(Float32, dim, N)
CC = CUDA.zeros(Float32, dim, N)
function cpu_matmul(C, A, B)
C = A * B
end
@time cpu_matmul(C, A, B)
function gpu_matmul(CC, AC, BC)
CC = AC * BC
end
CUDA.@time gpu_matmul(CC, AC, BC)
```

gives me the following output

```
0.005610 seconds (1.19 k allocations: 7.691 MiB)
0.017618 seconds (1.34 k CPU allocations: 68.410 KiB) (1 GPU allocation: 7.629 MiB, 0.03% gc time)
```

Based on this, I feel like I am doing something incorrectly. Does anyone have suggestions on increasing the GPU matmul performance?