Issues with distributed package


I have a code which has to deal with large arrays:

    l = length(A)
    m = 56^2
    flag = trues(l)
    flatten_connectivity = Array{Float32, 2}(undef, l, m)
    for i = 1:l
        flatten_connectivity[i, :] = sort(vec(B[A[i], A[i]])) 
    for i = 1:l
            flag[i] || continue
            for j = i+1:l
                 flag[j] || continue
                 if all(abs(flatten_connectivity[i, k] - flatten_connectivity[j, k]) <= 0.0001 for k=1:m)
                      flag[j] = false
    A = A[flag]

Which is the fastest code I was able to get for this work (without threading)

“A” is a Vector{Vector{Int8}} that can get quite large (100_000_000x56) and B is Matrix{Float32}(120x120), in this extreme case I have to find a balance between speed and memory usage. The maximum memory usage is close to 1 TB.

I have access to an HPC where each node have up to 512 GB so I have to make sure to use at least 2 nodes.

I am not sure how to spread the work with the Distributed package, my first approach was to do:

    flatten_connectivity = SharedArray{Float32}(l, m)
    @distributed (+) for i = 1:l
        flatten_connectivity[i, :] = sort(vec(connectivity[li1s_li2s[i], li1s_li2s[i]])) 

Which work, but as soon as I leave the loop, all the memory is placed again on the main process, which will provoke an OOM error. I tried to use the same thing on the second loop but I get “break or continue outside of for loop”…

Should I just split the array and use @spawnat so that each process deal with smaller part of the array?