I would like to accelerate multiplication of matrices that are larger than can fit in VRAM, but can fit in normal RAM. I would like to eventually extend this to general tensor contractions as well, but for now I am sticking with matrix multiplications as a model problem.
Currently, I’ve implemented a tiled matrix multiplication using the high level CuArrays and some simple loops. It is nothing fancy, and relies on the tile size being an even divisor of the matrix size. I plan to wrap this function such that padding of an input matrix is performed automatically, but for now I would just like to focus on the simpler cases.
I am looking for suggestions on how to improve this code, or if I should start over with some kernel based approach. Also, is it possible to use multiple threads to initiate data transfers and tile multiplications so that I can more effectively saturate my GPU?
using CuArrays, CUDAdrv, LinearAlgebra function GSGEMM!(C,A,B,blocksize=1024) C .= zero(Float32) m = size(A,1) k = size(A,2) n = size(B,2) device = CuDevice(0) mem = CUDAdrv.totalmem(device) minMem = 3*blocksize*blocksize*4 temp = CuArrays.zeros(blocksize,blocksize) @inbounds @fastmath for K=1:blocksize:k for i=1:blocksize:m if CUDAdrv.available_memory() < minMem GC.gc() end @views _A = CuArray(A[i:i+blocksize-1,K:K+blocksize-1]) for j=1:blocksize:n @views _B = CuArray(B[K:K+blocksize-1,j:j+blocksize-1]) mul!(temp,_A,_B) C[i:i+blocksize-1,j:j+blocksize-1] .+= Array(temp) end end end end