I am trying to perform many matrix exponentials in parallel. However, the multi-threaded code running on 40 cores (2x Skylake 6148) takes about twice as long as the single-threaded code. Here is a minimal working example:
using BenchmarkTools
using LinearAlgebra
using Random
function matExp!(D,A)
for ii = 1:size(D,3)
D[:,:,ii] = exp(A[:,:,ii]);
end
return nothing
end
function matExp_par!(D,A)
Threads.@threads for ii = 1:size(D,3)
D[:,:,ii] = exp(A[:,:,ii]);
end
return nothing
end
N = 2^15;
A = randn(4,4,N)
D = similar(A)
BLAS.set_num_threads(40) @benchmark matExp!(D,A)
BLAS.set_num_threads(1) @benchmark matExp!(D,A) @benchmark matExp_par!(D,A)
Realize that this does memory allocation in your inner loop, which is very bad for performance. Even if you use @views for the slicing, the exp function allocates a new array for the result and then copies it into D.
The allocations in the matrix exponentiation are going to be a bottleneck for working with exponentials of such small (4x4) matrices. There is an undocumented LinearAlgebra.exp! function that works partially in-place, but it still does many allocation of intermediate temporary arrays.
I would also encourage you to step back and try to thread at a coarser level — rather than trying to multithread a single loop that computes a zillion matrix exponentials and does nothing else, you should instead consider threading how those matrix exponentials are used. i.e. multi-thread a loop where the loop body does more work — e.g. computes the matrix, does a matrix exponential, and then does some other work with the matrix exponential. Multi-thread one big loop, not lots of little loops. (A Python/Matlab “vectorized” style is often your enemy here.)
Actually, it looks like someone has already done this. So: use StaticArrays.
The following gives me nearly linear speedup on my laptop (4 cores):
using StaticArrays, BenchmarkTools
matexp!(D,A) = D .= exp.(A);
function matexp_par!(D,A)
Threads.@threads for i = 1:length(A)
D[i] = exp(A[i])
end
return D
end
N = 2^15
A = rand(SMatrix{4,4,Float64,16}, N);
D = similar(A);
@btime matexp!($D,$A); # reports 0 allocations!
@btime matexp_par!($D,$A);
(But it might be too small to benefit from 40 cores. See also my above suggestion of parallelizing a “coarser” computation than just this loop by itself.)
From my experience with multi-threadiing I would say that your problem is so small that the multi-threading overhead kicks in.
Try splitting your data in larger batches. Does this improve the performance for any JULIA_NUM_THREADS in {2,4,8,16,32}?
function matExp_par!(D,A,numthreads)
# in this example size(D,3) must be multiple of numthreads
batchsize = div(size(D,3),numthreads)
Threads.@threads for i=1:numthreads
for j=1:batchsize
ii=(i-1)*batchsize+j
D[:,:,ii] = exp(A[:,:,ii])
end
end
return nothing
end
Edit: Just saw that @stevengj was a little faster to suggest this.
Thanks a ton both of you! I tried out splitting it in two for loops and parallelizing the outer one - as suggested, which, unfortunately, does not help. I also tried splitting it up with spawn into two (not 40) sub-problems, which is also slow.
However, the usage of static arrays made my jaw drop. This is about 1000x faster - and now 40 cores give me about a 10x speedup compared to single-core, which sounds reasonable to me.