When fitting a linear mixed effects model using the MixedModels
package most of the time is spent updating a Cholesky factorization of a blocked array. Within that updating operation a large part of the time is spend on in-place operations on blocks of columns in a dense matrix. For example, in this operation
function rmulΛ!(A::M, B::ReMat{T,S}) where {M<:AbstractMatrix{T}} where {T,S}
m, n = size(A)
q, r = divrem(n, S)
iszero(r) || throw(DimensionMismatch("size(A, 2) is not a multiple of block size"))
Bd = B.λ.data
m, n = size(A)
@inbounds for k = 1:q
coloffset = (k - 1) * S
for i = 1:m
for j = 1:S
Aij = A[i, j + coloffset] * Bd[j, j]
for k = j + 1:S
Aij += A[i, k + coloffset] * Bd[k, j]
end
A[i, j + coloffset] = Aij
end
end
end
A
end
the matrix Bd
is a lower triangular matrix whose size S
is small (single digit, usually) but the number of columns in A
may be in the thousands. The operation is to multiply successive blocks of S
columns in A
in-place by Bd
on the right. The number of blocks is q
.
Because this is being done in-place it seems natural to use multiple threads. However, the simple approach of sticking Threads.@threads
after @inbounds
actually slows things down and results in considerably more allocation of storage.
So what are good strategies for parallelizing this operation? The number of blocks, q
, can be computed at the time that the object B
is created. (It doesn’t show up here but that is known ahead of time.) So one approach is to include in B
a vector that partitions 1:q
into nthreads()
ranges of nearly equal sizes and spawn tasks for each of these sub-ranges. Alternatively, I could spawn a task for each of the blocks and let the threading mechanism handle the allocation of tasks to threads. For either of these approaches I’m not quite sure how to do the wait
until a group of tasks is finished. Pointers to existing code in packages would be welcome.
I appreciate that this explanation may be too vague. I am happy to respond to questions regarding my explanation.