Running a distributed calculation on JuliaHub


I’m trying to run a distributed nested loop on JuliaHub with 3-10 workers. The code evaluates a marginal likelihood function on a 3-d grid with the help of a distributed array:

logML_mat = Array{Float64}(undef, np, ngl, ngt, ngs)
logML_mat = distribute(logML_mat; dist=(1,1,1,nworkers()))
@sync @distributed for w in 1:nworkers()
    logML_mat_loc = localpart(logML_mat)
    li = localindices(logML_mat)
    for (ls,s) in enumerate(li[4])
        for p in 1:np
            for i in 1:ngl
                for j in 1:ngt  
                    logML_mat_loc[p,i,j,ls]  = logML( [lambda_grid[i], theta_grid[j], psi_grid[s]] )                        

The code runs successfully, but after the outer (distributed) loop is finished, the following error occurs:

LoadError: LoadError: LoadError: TaskFailedException nested task error: On worker 2: TaskFailedException nested task error: peer 3 is not connected to 2. Topology : master_worker

What might be the case?


1 Like

This failure has to do with the communication topology of the cluster, which only allows direct communication between the head node and worker nodes. @tanmaykm is looking at enabling all-to-all communication and will post an update here after he looks into it.

1 Like