Distributed loop and possible memory leak

Hello, I’m trying to distribute some computation in a for loop among my procs. Here is what I tried:

using Distributed
addprocs(5)
@everywhere begin
using LinearAlgebra
using SharedArrays
using BenchmarkTools
using Infiltrator
end

function testpar(r,K,H,Hold,WtX,WtW,LH)
    for k=1:K
        beta = 1/k
        @sync @distributed for i=1:r
            Hi = H[i,:]'
            Hextra = beta*(Hi-Hold[i,:]')
            Hold[i,:] = Hi
            H[i,:] = max.(Hi + Hextra + (WtX[i,:]'-WtW[i,:]'*H)/LH[i],1e-16)
        end
    end
end

function testnopar(r,K,H,Hold,WtX,WtW,LH)
    for k=1:K
        beta = 1/k
        for i=1:r
            Hi = H[i,:]'
            Hextra = beta*(Hi-Hold[i,:]')
            Hold[i,:] = Hi
            H[i,:] = max.(Hi + Hextra + (WtX[i,:]'-WtW[i,:]'*H)/LH[i],1e-16)
        end
    end
end

function main()
    m = 162
    n = 307*307
    r = 6
    K = 20
    W = rand(m,r)
    Hinit = rand(r,n)
    H = copy(Hinit)
    Hold = copy(H)
    X = rand(m,n)
    WtX = W'*X
    WtW = W'*W
    LH = diag(WtW)
    @btime testnopar($r,$K,$H,$Hold,$WtX,$WtW,$LH)

    H = SharedArray(copy(Hinit))
    Hold = SharedArray(copy(H))
    @btime testpar($r,$K,$H,$Hold,$WtX,$WtW,$LH)
end

main()

and here is the output:

  250.167 ms (2760 allocations: 949.35 MiB)
  259.098 ms (18263 allocations: 1.05 MiB)

In my original code, the line H[i,:] = max.( … ) is the bottleneck because n is large. So I thought that it could be interesting to distribute it. In the end, in this simple test I don’t observe some gain in computation time in the distributed version, and I don’t understand why the allocated memory is that important (1GB) for the undistributed version. I’m probably doing lots of wrong things (coming from matlab) but I don’t know what.