Hi folks,
I observed something strange recently. Passing a distributed matrix to a function will hurt its performance?
Let’s say here is task to apply a distributed matrix to some vectors like below:
using Distributed
using DistributedArrays
using LinearAlgebra
% distributed matvec multiplication
% Assume L is a sparse matrix of size N and some number of processors have been added using addprocs, %then DL would be a distributed version of L.
DL = distribute(L, procs=workers(), dist=[length(workers()), 1])
V = rand(N, 5)
% Let’s call the task below “matvec”
% Y is a distributed matrix to store resulting matrix
walltime_matvec = @elapsed begin
Y = dzeros((N,5), workers(), [length(workers()), 1])
wkers = workers()
@sync for i in 1:length(workers())
@spawnat wkers[i] localpart(Y)[:,:] = localpart(DL)*V
end
Y = Matrix{Float64}(Y)
end
% Let’s wrap the “matvec” task up as a function and pass DL and V as parameters, as following.
% And we call this task #“matvec_func”.
function matvec_func(A, V)
Y = dzeros((N,5), workers(), [length(workers()), 1])
wkers = workers()
@sync for i in 1:length(workers())
@spawnat wkers[i] localpart(Y)[:,:] = localpart(A)*V
end
Y = Matrix{Float64}(Y)
return Y
end
walltime_matvec_func = @elapsed matvec_func(DL, V)
I found time_matvec_func is usually much larger than time_matvec (See the figure below). It looks so strange. Could anyone explain this? Or any suggestion?
The figure above summarizes the wall time of each task I mentioned against the number of processors I used. In the figure, the matrix size is 10^7 * 10^7. This experiment ran on a workstation with 2 NUMA nodes. Each node contains 2 sockets and each socket contains 24 CPUs. Please only read the “matvec” and “matvec_func” curves because others curves are for other tasks.