Multi-threaded matrix exponential slower than the single threaded version

I am trying to perform many matrix exponentials in parallel. However, the multi-threaded code running on 40 cores (2x Skylake 6148) takes about twice as long as the single-threaded code. Here is a minimal working example:

using BenchmarkTools
using LinearAlgebra
using Random

function matExp!(D,A)
for ii = 1:size(D,3)
D[:,:,ii] = exp(A[:,:,ii]);
return nothing

function matExp_par!(D,A)
Threads.@threads for ii = 1:size(D,3)
D[:,:,ii] = exp(A[:,:,ii]);
return nothing

N = 2^15;
A = randn(4,4,N)
D = similar(A)
@benchmark matExp!(D,A)
@benchmark matExp!(D,A)
@benchmark matExp_par!(D,A)

The benchmark results are

BLAS threads = 40, normal for loop:

memory estimate: 106.04 MiB
allocs estimate: 567373

minimum time: 370.978 ms (0.00% GC)
median time: 413.522 ms (10.35% GC)
mean time: 408.831 ms (7.42% GC)
maximum time: 443.567 ms (11.15% GC)

samples: 13
evals/sample: 1

BLAS threads = 1, normal for loop:

memory estimate: 106.04 MiB
allocs estimate: 567373

minimum time: 194.300 ms (0.00% GC)
median time: 217.474 ms (10.13% GC)
mean time: 217.505 ms (7.50% GC)
maximum time: 238.008 ms (9.28% GC)

samples: 23
evals/sample: 1

BLAS threads = 1, multi-threaded for loop:

memory estimate: 106.07 MiB
allocs estimate: 567574

minimum time: 483.418 ms (0.00% GC)
median time: 484.606 ms (0.00% GC)
mean time: 484.666 ms (0.00% GC)
maximum time: 485.685 ms (0.00% GC)

samples: 11
evals/sample: 1

Thanks for your help!

Did you set JULIA_NUM_THREADS? (Check Threads.nthreads().)

1 Like

Yes, it’s set to 40 and top shows CPU usage close to 4000% throughout the multi-threaded execution.

Realize that this does memory allocation in your inner loop, which is very bad for performance. Even if you use @views for the slicing, the exp function allocates a new array for the result and then copies it into D.

The allocations in the matrix exponentiation are going to be a bottleneck for working with exponentials of such small (4x4) matrices. There is an undocumented LinearAlgebra.exp! function that works partially in-place, but it still does many allocation of intermediate temporary arrays.

If you are really looking at 4x4 matrices (i.e. these aren’t just a synthetic benchmark), it should be possible to implement something vastly better, by porting the exp! implementation (based on a Padé approximation) to use StaticArrays. StaticArrays are almost always a good idea when working with lots of tiny matrices and vectors.

I would also encourage you to step back and try to thread at a coarser level — rather than trying to multithread a single loop that computes a zillion matrix exponentials and does nothing else, you should instead consider threading how those matrix exponentials are used. i.e. multi-thread a loop where the loop body does more work — e.g. computes the matrix, does a matrix exponential, and then does some other work with the matrix exponential. Multi-thread one big loop, not lots of little loops. (A Python/Matlab “vectorized” style is often your enemy here.)


Actually, it looks like someone has already done this. So: use StaticArrays.

The following gives me nearly linear speedup on my laptop (4 cores):

using StaticArrays, BenchmarkTools

matexp!(D,A) = D .= exp.(A);
function matexp_par!(D,A)
    Threads.@threads for i = 1:length(A)
        D[i] = exp(A[i])
    return D

N = 2^15
A = rand(SMatrix{4,4,Float64,16}, N);
D = similar(A);

@btime matexp!($D,$A);           # reports 0 allocations!
@btime matexp_par!($D,$A);

(But it might be too small to benefit from 40 cores. See also my above suggestion of parallelizing a “coarser” computation than just this loop by itself.)


From my experience with multi-threadiing I would say that your problem is so small that the multi-threading overhead kicks in.

Try splitting your data in larger batches. Does this improve the performance for any JULIA_NUM_THREADS in {2,4,8,16,32}?

function matExp_par!(D,A,numthreads)
    # in this example size(D,3) must be multiple of numthreads
    batchsize = div(size(D,3),numthreads)
    Threads.@threads for i=1:numthreads
        for j=1:batchsize
            D[:,:,ii] = exp(A[:,:,ii])
    return nothing

Edit: Just saw that @stevengj was a little faster to suggest this.

1 Like

Thanks a ton both of you! I tried out splitting it in two for loops and parallelizing the outer one - as suggested, which, unfortunately, does not help. I also tried splitting it up with spawn into two (not 40) sub-problems, which is also slow.

However, the usage of static arrays made my jaw drop. This is about 1000x faster - and now 40 cores give me about a 10x speedup compared to single-core, which sounds reasonable to me.

Once more, thanks a lot!


Note that this is already what Threads.@threads does.

1 Like