Tiled matrix multiplication

Hello,

I would like to accelerate multiplication of matrices that are larger than can fit in VRAM, but can fit in normal RAM. I would like to eventually extend this to general tensor contractions as well, but for now I am sticking with matrix multiplications as a model problem.

Currently, I’ve implemented a tiled matrix multiplication using the high level CuArrays and some simple loops. It is nothing fancy, and relies on the tile size being an even divisor of the matrix size. I plan to wrap this function such that padding of an input matrix is performed automatically, but for now I would just like to focus on the simpler cases.

I am looking for suggestions on how to improve this code, or if I should start over with some kernel based approach. Also, is it possible to use multiple threads to initiate data transfers and tile multiplications so that I can more effectively saturate my GPU?

using CuArrays, CUDAdrv, LinearAlgebra
function GSGEMM!(C,A,B,blocksize=1024)
    C .= zero(Float32)
    m = size(A,1)
    k = size(A,2)
    n = size(B,2)
    device = CuDevice(0)
    mem = CUDAdrv.totalmem(device)
    minMem = 3*blocksize*blocksize*4
    temp = CuArrays.zeros(blocksize,blocksize)
    @inbounds @fastmath for K=1:blocksize:k
        for i=1:blocksize:m
            if CUDAdrv.available_memory() < minMem
                GC.gc()
            end
            @views _A = CuArray(A[i:i+blocksize-1,K:K+blocksize-1])
            for j=1:blocksize:n
                @views _B = CuArray(B[K:K+blocksize-1,j:j+blocksize-1])
                mul!(temp,_A,_B)
                C[i:i+blocksize-1,j:j+blocksize-1] .+= Array(temp)
            end
        end
    end
end

@kslimes has some experience with this.

For anyone coming across this thread, I’ve improved the code somewhat with the following modifications. 9s → 6s on my machine GPU, vs 26s CPU mul!

function GSGEMM!(C,A,B,blocksize=1024)
    C .= zero(Float32)
    m = size(A,1)
    k = size(A,2)
    n = size(B,2)
    temp = CuArray{eltype(A)}(undef,blocksize,blocksize)
    _A = CuArray{eltype(A)}(undef,blocksize,blocksize)
    _B = CuArray{eltype(A)}(undef,blocksize,blocksize)
    aptr = pointer(_A)
    bptr = pointer(_B)
    tptr = pointer(temp)
    htmp = Array{eltype(A)}(undef,blocksize,blocksize)
    hptr = pointer(htmp)
    x = nothing
    for K=1:blocksize:k
        for i=1:blocksize:m
            unsafe_copyto!(aptr,pointer(A[i:i+blocksize-1,K:K+blocksize-1]),blocksize*blocksize)
            for j=1:blocksize:n
                unsafe_copyto!(bptr,pointer(B[K:K+blocksize-1,j:j+blocksize-1]),blocksize*blocksize)
                x != nothing ? wait(x) : nothing
                mul!(temp,_A,_B)
                x = @async begin
                    unsafe_copyto!(hptr,tptr,blocksize*blocksize)
                    @fastmath C[i:i+blocksize-1,j:j+blocksize-1] .+= htmp
                end
            end
        end
    end
end
4 Likes

Impressive. What kind of GPU do you have?

Thanks! Just a GTX 1060 6GB. Saving up for a nicer card someday. The 6s time is for multiplying two 16384 matrices with a blocksize of 8192. For a blocksize of 4096 its 8.5s .