 # Tiled matrix multiplication

Hello,

I would like to accelerate multiplication of matrices that are larger than can fit in VRAM, but can fit in normal RAM. I would like to eventually extend this to general tensor contractions as well, but for now I am sticking with matrix multiplications as a model problem.

Currently, I’ve implemented a tiled matrix multiplication using the high level CuArrays and some simple loops. It is nothing fancy, and relies on the tile size being an even divisor of the matrix size. I plan to wrap this function such that padding of an input matrix is performed automatically, but for now I would just like to focus on the simpler cases.

I am looking for suggestions on how to improve this code, or if I should start over with some kernel based approach. Also, is it possible to use multiple threads to initiate data transfers and tile multiplications so that I can more effectively saturate my GPU?

``````using CuArrays, CUDAdrv, LinearAlgebra
function GSGEMM!(C,A,B,blocksize=1024)
C .= zero(Float32)
m = size(A,1)
k = size(A,2)
n = size(B,2)
device = CuDevice(0)
minMem = 3*blocksize*blocksize*4
temp = CuArrays.zeros(blocksize,blocksize)
@inbounds @fastmath for K=1:blocksize:k
for i=1:blocksize:m
GC.gc()
end
@views _A = CuArray(A[i:i+blocksize-1,K:K+blocksize-1])
for j=1:blocksize:n
@views _B = CuArray(B[K:K+blocksize-1,j:j+blocksize-1])
mul!(temp,_A,_B)
C[i:i+blocksize-1,j:j+blocksize-1] .+= Array(temp)
end
end
end
end
``````

@kslimes has some experience with this.

For anyone coming across this thread, I’ve improved the code somewhat with the following modifications. 9s -> 6s on my machine GPU, vs 26s CPU `mul!`

``````function GSGEMM!(C,A,B,blocksize=1024)
C .= zero(Float32)
m = size(A,1)
k = size(A,2)
n = size(B,2)
temp = CuArray{eltype(A)}(undef,blocksize,blocksize)
_A = CuArray{eltype(A)}(undef,blocksize,blocksize)
_B = CuArray{eltype(A)}(undef,blocksize,blocksize)
aptr = pointer(_A)
bptr = pointer(_B)
tptr = pointer(temp)
htmp = Array{eltype(A)}(undef,blocksize,blocksize)
hptr = pointer(htmp)
x = nothing
for K=1:blocksize:k
for i=1:blocksize:m
unsafe_copyto!(aptr,pointer(A[i:i+blocksize-1,K:K+blocksize-1]),blocksize*blocksize)
for j=1:blocksize:n
unsafe_copyto!(bptr,pointer(B[K:K+blocksize-1,j:j+blocksize-1]),blocksize*blocksize)
x != nothing ? wait(x) : nothing
mul!(temp,_A,_B)
x = @async begin
unsafe_copyto!(hptr,tptr,blocksize*blocksize)
@fastmath C[i:i+blocksize-1,j:j+blocksize-1] .+= htmp
end
end
end
end
end
``````
4 Likes

Impressive. What kind of GPU do you have?

Thanks! Just a GTX 1060 6GB. Saving up for a nicer card someday. The 6s time is for multiplying two 16384 matrices with a blocksize of 8192. For a blocksize of 4096 its 8.5s .