Strategy for using multiple threads in an in-place matrix operation

dmbates · January 6, 2020, 8:06pm

When fitting a linear mixed effects model using the MixedModels package most of the time is spent updating a Cholesky factorization of a blocked array. Within that updating operation a large part of the time is spend on in-place operations on blocks of columns in a dense matrix. For example, in this operation

function rmulΛ!(A::M, B::ReMat{T,S}) where {M<:AbstractMatrix{T}} where {T,S}
    m, n = size(A)
    q, r = divrem(n, S)
    iszero(r) || throw(DimensionMismatch("size(A, 2) is not a multiple of block size"))
    Bd = B.λ.data
    m, n = size(A)
    @inbounds for k = 1:q
        coloffset = (k - 1) * S
        for i = 1:m
            for j = 1:S
                Aij = A[i, j + coloffset] * Bd[j, j]
                for k = j + 1:S
                    Aij += A[i, k + coloffset] * Bd[k, j]
                end
                A[i, j + coloffset] = Aij
            end
        end
    end
    A
end

the matrix Bd is a lower triangular matrix whose size S is small (single digit, usually) but the number of columns in A may be in the thousands. The operation is to multiply successive blocks of S columns in A in-place by Bd on the right. The number of blocks is q.

Because this is being done in-place it seems natural to use multiple threads. However, the simple approach of sticking Threads.@threads after @inbounds actually slows things down and results in considerably more allocation of storage.

So what are good strategies for parallelizing this operation? The number of blocks, q, can be computed at the time that the object B is created. (It doesn’t show up here but that is known ahead of time.) So one approach is to include in B a vector that partitions 1:q into nthreads() ranges of nearly equal sizes and spawn tasks for each of these sub-ranges. Alternatively, I could spawn a task for each of the blocks and let the threading mechanism handle the allocation of tasks to threads. For either of these approaches I’m not quite sure how to do the wait until a group of tasks is finished. Pointers to existing code in packages would be welcome.

I appreciate that this explanation may be too vague. I am happy to respond to questions regarding my explanation.

dmbates · January 6, 2020, 8:38pm

I see that the approach in the CSV package is to wrap the spawning a collection of tasks in an @sync block.

Ralph_Smith · January 8, 2020, 12:18am

The @threads macro builds a fairly complicated thing which sometimes stupefies the compiler. By introducing a function barrier, i.e. calling

function _sub(A,Bd,coloffset,m,S)
    for i = 1:m
        @inbounds for j = 1:S
            Aij = A[i, j + coloffset] * Bd[j, j]
            for k = j + 1:S
                Aij += A[i, k + coloffset] * Bd[k, j]
            end
            A[i, j + coloffset] = Aij
        end
    end
end

inside the threaded loop, I see substantial speedup, e.g. 5x on 8 cores, and moderate allocation. Might be better with a different memory layout.

dmbates · January 9, 2020, 8:33pm

Thanks. That indeed helps.

Topic		Replies	Views
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2478	July 13, 2020
Best way to parallelize Julia at Scale parallel , linearalgebra , sparse	12	1341	August 25, 2022
How to do chunk matrix/vector multiplication with @threads Performance	1	1430	August 28, 2018
Parallel implementation is not very effective General Usage question , performance , parallel , multithreading	19	790	October 8, 2022
How to spawn persistent threads and reuse them? Performance parallel , multithreading , sparse	16	261	February 17, 2025

Strategy for using multiple threads in an in-place matrix operation

Related topics