How to parellalizing many matrix solves

I am trying to solve many matrix equations Ax=b of fixed size. A simplified version of the code is as follows:

function ser_solve(A,b,res,nu)
    for i in(1:nu)
        res[i,:]=A[i,:,:]\b[i,:]
    end
    return res;
end

nu = 10000; si = 100;
A = randn(nu,si,si); b = randn(nu,si);
res = zeros(nu,si);

ser_solve(A,b,res,nu);

The above code is small for now, but that’s just to demonstrate the problem. The sizes of the problems will grow. How would I go about using distributed computing for this code? I have attempted to use @threads in the for loop, but the speedup is nearly non-existent. My attempts at using @distributed and shared arrays haven’t seemed to work. How would I do this the correct way? Is there possibly a way to use pmap to do this?

I should also mention that the matrices in the code I am running are actually populated by some auxiliary functions. So what would be the best way to both parallelize that population of the matrices and then solve them?

Two things to note here. res[i,:]=A[i,:,:]\b[i,:] should be written as res[i,:]=@views A[i,:,:]\b[i,:] to prevent copies. Also BLAS automatically uses threading for solves, so you won’t get a ton of speedup. That said, setting BLAS threads to 1 and threading the loop should lead to speedup.

I’m not sure that would be beneficial in this case as the memory is not contiguous. If the memory layout was changed to respect the column major storage of Julia, the views would make sense for sure.

2 Likes

Confirming that switching the index order and adding @views gives about 2x speedup.

1 Like

Hey thanks for you response! I have actually tried setting the BLAS thread to one and with @threads it does seem to give some speedup. The issue is that eventually the size and amount of the matrix solves will grow. Meaning that both si and nu will increase. So it seems to me that I will have to distribute the computation over somehow, possibly on a cluster. I just don’t know how exactly to do that.

These are fair points I will incorporate into the code. Thanks for that. But out of curiosity, how would you go about distributing this problem over a cluster? This is what I am looking to do.