Hello, I’m trying to distribute some computation in a for loop among my procs. Here is what I tried:
using Distributed
addprocs(5)
@everywhere begin
using LinearAlgebra
using SharedArrays
using BenchmarkTools
using Infiltrator
end
function testpar(r,K,H,Hold,WtX,WtW,LH)
for k=1:K
beta = 1/k
@sync @distributed for i=1:r
Hi = H[i,:]'
Hextra = beta*(Hi-Hold[i,:]')
Hold[i,:] = Hi
H[i,:] = max.(Hi + Hextra + (WtX[i,:]'-WtW[i,:]'*H)/LH[i],1e-16)
end
end
end
function testnopar(r,K,H,Hold,WtX,WtW,LH)
for k=1:K
beta = 1/k
for i=1:r
Hi = H[i,:]'
Hextra = beta*(Hi-Hold[i,:]')
Hold[i,:] = Hi
H[i,:] = max.(Hi + Hextra + (WtX[i,:]'-WtW[i,:]'*H)/LH[i],1e-16)
end
end
end
function main()
m = 162
n = 307*307
r = 6
K = 20
W = rand(m,r)
Hinit = rand(r,n)
H = copy(Hinit)
Hold = copy(H)
X = rand(m,n)
WtX = W'*X
WtW = W'*W
LH = diag(WtW)
@btime testnopar($r,$K,$H,$Hold,$WtX,$WtW,$LH)
H = SharedArray(copy(Hinit))
Hold = SharedArray(copy(H))
@btime testpar($r,$K,$H,$Hold,$WtX,$WtW,$LH)
end
main()
and here is the output:
250.167 ms (2760 allocations: 949.35 MiB)
259.098 ms (18263 allocations: 1.05 MiB)
In my original code, the line H[i,:] = max.( … ) is the bottleneck because n is large. So I thought that it could be interesting to distribute it. In the end, in this simple test I don’t observe some gain in computation time in the distributed version, and I don’t understand why the allocated memory is that important (1GB) for the undistributed version. I’m probably doing lots of wrong things (coming from matlab) but I don’t know what.