Hello,

I would like to accelerate multiplication of matrices that are larger than can fit in VRAM, but can fit in normal RAM. I would like to eventually extend this to general tensor contractions as well, but for now I am sticking with matrix multiplications as a model problem.

Currently, I’ve implemented a tiled matrix multiplication using the high level CuArrays and some simple loops. It is nothing fancy, and relies on the tile size being an even divisor of the matrix size. I plan to wrap this function such that padding of an input matrix is performed automatically, but for now I would just like to focus on the simpler cases.

I am looking for suggestions on how to improve this code, or if I should start over with some kernel based approach. Also, is it possible to use multiple threads to initiate data transfers and tile multiplications so that I can more effectively saturate my GPU?

```
using CuArrays, CUDAdrv, LinearAlgebra
function GSGEMM!(C,A,B,blocksize=1024)
C .= zero(Float32)
m = size(A,1)
k = size(A,2)
n = size(B,2)
device = CuDevice(0)
mem = CUDAdrv.totalmem(device)
minMem = 3*blocksize*blocksize*4
temp = CuArrays.zeros(blocksize,blocksize)
@inbounds @fastmath for K=1:blocksize:k
for i=1:blocksize:m
if CUDAdrv.available_memory() < minMem
GC.gc()
end
@views _A = CuArray(A[i:i+blocksize-1,K:K+blocksize-1])
for j=1:blocksize:n
@views _B = CuArray(B[K:K+blocksize-1,j:j+blocksize-1])
mul!(temp,_A,_B)
C[i:i+blocksize-1,j:j+blocksize-1] .+= Array(temp)
end
end
end
end
```