I am trying to perform many matrix exponentials in parallel. However, the multi-threaded code running on 40 cores (2x Skylake 6148) takes about twice as long as the single-threaded code. Here is a minimal working example:

using BenchmarkTools
using LinearAlgebra
using Random

function matExp!(D,A)
for ii = 1:size(D,3)
D[:,:,ii] = exp(A[:,:,ii]);
end
return nothing
end

function matExp_par!(D,A)
D[:,:,ii] = exp(A[:,:,ii]);
end
return nothing
end

N = 2^15;
A = randn(4,4,N)
D = similar(A)
@benchmark matExp!(D,A)
@benchmark matExp!(D,A)
@benchmark matExp_par!(D,A)

The benchmark results are

BLAS threads = 40, normal for loop:

minimum time: 370.978 ms (0.00% GC) median time: 413.522 ms (10.35% GC) mean time: 408.831 ms (7.42% GC) maximum time: 443.567 ms (11.15% GC)

samples: 13
evals/sample: 1

BLAS threads = 1, normal for loop:

samples: 23
evals/sample: 1

minimum time: 483.418 ms (0.00% GC) median time: 484.606 ms (0.00% GC) mean time: 484.666 ms (0.00% GC) maximum time: 485.685 ms (0.00% GC)

samples: 11
evals/sample: 1

Did you set `JULIA_NUM_THREADS`? (Check `Threads.nthreads()`.)

1 Like

Yes, itâ€™s set to 40 and top shows CPU usage close to 4000% throughout the multi-threaded execution.

Realize that this does memory allocation in your inner loop, which is very bad for performance. Even if you use `@views` for the slicing, the `exp` function allocates a new array for the result and then copies it into `D`.

The allocations in the matrix exponentiation are going to be a bottleneck for working with exponentials of such small (4x4) matrices. There is an undocumented `LinearAlgebra.exp!` function that works partially in-place, but it still does many allocation of intermediate temporary arrays.

If you are really looking at 4x4 matrices (i.e. these arenâ€™t just a synthetic benchmark), it should be possible to implement something vastly better, by porting the `exp!` implementation (based on a PadĂ© approximation) to use StaticArrays. StaticArrays are almost always a good idea when working with lots of tiny matrices and vectors.

I would also encourage you to step back and try to thread at a coarser level â€” rather than trying to multithread a single loop that computes a zillion matrix exponentials and does nothing else, you should instead consider threading how those matrix exponentials are used. i.e. multi-thread a loop where the loop body does more work â€” e.g. computes the matrix, does a matrix exponential, and then does some other work with the matrix exponential. Multi-thread one big loop, not lots of little loops. (A Python/Matlab â€śvectorizedâ€ť style is often your enemy here.)

6 Likes

Actually, it looks like someone has already done this. So: use StaticArrays.

The following gives me nearly linear speedup on my laptop (4 cores):

``````using StaticArrays, BenchmarkTools

matexp!(D,A) = D .= exp.(A);
function matexp_par!(D,A)
D[i] = exp(A[i])
end
return D
end

N = 2^15
A = rand(SMatrix{4,4,Float64,16}, N);
D = similar(A);

@btime matexp!(\$D,\$A);           # reports 0 allocations!
@btime matexp_par!(\$D,\$A);
``````

(But it might be too small to benefit from 40 cores. See also my above suggestion of parallelizing a â€ścoarserâ€ť computation than just this loop by itself.)

7 Likes

Try splitting your data in larger batches. Does this improve the performance for any `JULIA_NUM_THREADS` in {2,4,8,16,32}?

``````function matExp_par!(D,A,numthreads)
# in this example size(D,3) must be multiple of numthreads
for j=1:batchsize
ii=(i-1)*batchsize+j
D[:,:,ii] = exp(A[:,:,ii])
end
end
return nothing
end
``````

Edit: Just saw that @stevengj was a little faster to suggest this.

1 Like

Thanks a ton both of you! I tried out splitting it in two for loops and parallelizing the outer one - as suggested, which, unfortunately, does not help. I also tried splitting it up with spawn into two (not 40) sub-problems, which is also slow.

However, the usage of static arrays made my jaw drop. This is about 1000x faster - and now 40 cores give me about a 10x speedup compared to single-core, which sounds reasonable to me.

Once more, thanks a lot!

9 Likes

Note that this is already what `Threads.@threads` does.

1 Like