Hi folks,

I observed something strange recently. Passing a distributed matrix to a function will hurt its performance?

Let’s say here is task to apply a distributed matrix to some vectors like below:

```
using Distributed
using DistributedArrays
using LinearAlgebra
```

% distributed matvec multiplication

% Assume L is a sparse matrix of size N and some number of processors have been added using addprocs, %then DL would be a distributed version of L.

```
DL = distribute(L, procs=workers(), dist=[length(workers()), 1])
V = rand(N, 5)
```

% Let’s call the task below “matvec”

% Y is a distributed matrix to store resulting matrix

```
walltime_matvec = @elapsed begin
Y = dzeros((N,5), workers(), [length(workers()), 1])
wkers = workers()
@sync for i in 1:length(workers())
@spawnat wkers[i] localpart(Y)[:,:] = localpart(DL)*V
end
Y = Matrix{Float64}(Y)
end
```

% Let’s wrap the “matvec” task up as a function and pass DL and V as parameters, as following.

% And we call this task #“matvec_func”.

```
function matvec_func(A, V)
Y = dzeros((N,5), workers(), [length(workers()), 1])
wkers = workers()
@sync for i in 1:length(workers())
@spawnat wkers[i] localpart(Y)[:,:] = localpart(A)*V
end
Y = Matrix{Float64}(Y)
return Y
end
walltime_matvec_func = @elapsed matvec_func(DL, V)
```

I found time_matvec_func is usually much larger than time_matvec (See the figure below). It looks so strange. Could anyone explain this? Or any suggestion?

The figure above summarizes the wall time of each task I mentioned against the number of processors I used. In the figure, the matrix size is 10^7 * 10^7. This experiment ran on a workstation with 2 NUMA nodes. Each node contains 2 sockets and each socket contains 24 CPUs. Please only read the “matvec” and “matvec_func” curves because others curves are for other tasks.