# Strategy for using multiple threads in an in-place matrix operation

When fitting a linear mixed effects model using the `MixedModels` package most of the time is spent updating a Cholesky factorization of a blocked array. Within that updating operation a large part of the time is spend on in-place operations on blocks of columns in a dense matrix. For example, in this operation

``````function rmulΛ!(A::M, B::ReMat{T,S}) where {M<:AbstractMatrix{T}} where {T,S}
m, n = size(A)
q, r = divrem(n, S)
iszero(r) || throw(DimensionMismatch("size(A, 2) is not a multiple of block size"))
Bd = B.λ.data
m, n = size(A)
@inbounds for k = 1:q
coloffset = (k - 1) * S
for i = 1:m
for j = 1:S
Aij = A[i, j + coloffset] * Bd[j, j]
for k = j + 1:S
Aij += A[i, k + coloffset] * Bd[k, j]
end
A[i, j + coloffset] = Aij
end
end
end
A
end
``````

the matrix `Bd` is a lower triangular matrix whose size `S` is small (single digit, usually) but the number of columns in `A` may be in the thousands. The operation is to multiply successive blocks of `S` columns in `A` in-place by `Bd` on the right. The number of blocks is `q`.

Because this is being done in-place it seems natural to use multiple threads. However, the simple approach of sticking `Threads.@threads` after `@inbounds` actually slows things down and results in considerably more allocation of storage.

So what are good strategies for parallelizing this operation? The number of blocks, `q`, can be computed at the time that the object `B` is created. (It doesn’t show up here but that is known ahead of time.) So one approach is to include in `B` a vector that partitions `1:q` into `nthreads()` ranges of nearly equal sizes and spawn tasks for each of these sub-ranges. Alternatively, I could spawn a task for each of the blocks and let the threading mechanism handle the allocation of tasks to threads. For either of these approaches I’m not quite sure how to do the `wait` until a group of tasks is finished. Pointers to existing code in packages would be welcome.

I appreciate that this explanation may be too vague. I am happy to respond to questions regarding my explanation.

I see that the approach in the `CSV` package is to wrap the spawning a collection of tasks in an `@sync` block.

The `@threads` macro builds a fairly complicated thing which sometimes stupefies the compiler. By introducing a function barrier, i.e. calling

``````function _sub(A,Bd,coloffset,m,S)
for i = 1:m
@inbounds for j = 1:S
Aij = A[i, j + coloffset] * Bd[j, j]
for k = j + 1:S
Aij += A[i, k + coloffset] * Bd[k, j]
end
A[i, j + coloffset] = Aij
end
end
end
``````

inside the threaded loop, I see substantial speedup, e.g. 5x on 8 cores, and moderate allocation. Might be better with a different memory layout.

3 Likes

Thanks. That indeed helps.