Distributed computing of x = A \ b

N = 1000
x = sprand(N, N, 0.5) \ full(sprand(N, 1, 0.5)

The code is giving us some solutions. For N=10000 memory runs out of my laptop machine. What are my options if I want to go, for example, N=10000000, assumed that there is enough performance in distributed machines?

Elemental.jl, Trilinos.jl, and PETSc.jl

2 Likes

Is there any kind of extensive tutorial on PETSc which covers creating a matrix and all other steps? The documentation on PETSc.jl points to the documentation for the underlying library, but it is a bit incomprehensive.
By the way, do you know if it is possible to use PETSc with mixed parallelism? Distributed across the nodes, but shared memory and multithreaded inside each node?

1 Like

Also, p = 0.5 is way too high for typical uses of sparse matrices: see this discussion.

1 Like

Agree. A real sparse matrix structure in our case is typically positive definite, symmetric and has a small bandwidth.

If it’s banded, adding support for distributed matrices in BandedMatrices.jl is in the pipeline, though a few years off.

EDIT: you might not need DistributedArrays if you work with BandedMatrix, though make sure you compile Julia with MKL as OpenBLAS currently crashes for banded matrices of size more than 10 million.

What we are looking for is a some kind of solution for a quite standard parallel solution of large physical problems which are discretized using e.g. finite element method. Typical setup in cluster environment is that there is 1-20 machines, all machines having 12-48 cores and having memory about 24-1024 GB. The calculation cluster can be considered as homogeneous so that each machine in computational job is considered to be same, so that solution of some physical problem is done using for example 10 machines, each containing 128 GB memory and having 24 cores, so as a summary we have calculation power of 240 cores with 1280 GB of memory. This kind of setup can easily find solution for physical problems when the amount of dofs is something 10-15 million. And this kind of resources we should utilize to get solutions for big problems.

I can only verbally describe the solution cycle, but for my understanding is that for each calculation node, if we efficiently use all the machine performance, we need to

  1. assemble the part of the larger coefficient matrix A and right hand side f, let’s call those A_i and f_i, threaded locally on machine i
  2. solve the global system Ax = b

To be efficient, this cycle needs to be “mixed parallel” like proposed by @Gregstrq, so that domain decomposition is done only for N pieces where N is the number of available computers. Inside each node threading is used to make all operations efficient, so that (in example) all 24 cores is utilized for constructing the final matrices.

From the list above, one is missing, MUMPS.jl. We have a good experience using MUMPS to make things parallel. So it also could work.

What we are looking for and what are we missing, is a good workflow to solve big problems. A big extra for the setting is of course to make anything ready for cloud-computing as one can always buy more computational power from cloud services when the own machines are not big enough for big problems. Given credit card, A and b, what is the solution?

1 Like