Strategy for using multiple threads in an in-place matrix operation

When fitting a linear mixed effects model using the MixedModels package most of the time is spent updating a Cholesky factorization of a blocked array. Within that updating operation a large part of the time is spend on in-place operations on blocks of columns in a dense matrix. For example, in this operation

function rmulΛ!(A::M, B::ReMat{T,S}) where {M<:AbstractMatrix{T}} where {T,S}
    m, n = size(A)
    q, r = divrem(n, S)
    iszero(r) || throw(DimensionMismatch("size(A, 2) is not a multiple of block size"))
    Bd = B.λ.data
    m, n = size(A)
    @inbounds for k = 1:q
        coloffset = (k - 1) * S
        for i = 1:m
            for j = 1:S
                Aij = A[i, j + coloffset] * Bd[j, j]
                for k = j + 1:S
                    Aij += A[i, k + coloffset] * Bd[k, j]
                end
                A[i, j + coloffset] = Aij
            end
        end
    end
    A
end

the matrix Bd is a lower triangular matrix whose size S is small (single digit, usually) but the number of columns in A may be in the thousands. The operation is to multiply successive blocks of S columns in A in-place by Bd on the right. The number of blocks is q.

Because this is being done in-place it seems natural to use multiple threads. However, the simple approach of sticking Threads.@threads after @inbounds actually slows things down and results in considerably more allocation of storage.

So what are good strategies for parallelizing this operation? The number of blocks, q, can be computed at the time that the object B is created. (It doesn’t show up here but that is known ahead of time.) So one approach is to include in B a vector that partitions 1:q into nthreads() ranges of nearly equal sizes and spawn tasks for each of these sub-ranges. Alternatively, I could spawn a task for each of the blocks and let the threading mechanism handle the allocation of tasks to threads. For either of these approaches I’m not quite sure how to do the wait until a group of tasks is finished. Pointers to existing code in packages would be welcome.

I appreciate that this explanation may be too vague. I am happy to respond to questions regarding my explanation.

I see that the approach in the CSV package is to wrap the spawning a collection of tasks in an @sync block.

The @threads macro builds a fairly complicated thing which sometimes stupefies the compiler. By introducing a function barrier, i.e. calling

function _sub(A,Bd,coloffset,m,S)
    for i = 1:m
        @inbounds for j = 1:S
            Aij = A[i, j + coloffset] * Bd[j, j]
            for k = j + 1:S
                Aij += A[i, k + coloffset] * Bd[k, j]
            end
            A[i, j + coloffset] = Aij
        end
    end
end

inside the threaded loop, I see substantial speedup, e.g. 5x on 8 cores, and moderate allocation. Might be better with a different memory layout.

3 Likes

Thanks. That indeed helps.