Distributed computing of x = A \ b

ahojukka5 · July 29, 2018, 1:37am

N = 1000
x = sprand(N, N, 0.5) \ full(sprand(N, 1, 0.5)

The code is giving us some solutions. For N=10000 memory runs out of my laptop machine. What are my options if I want to go, for example, N=10000000, assumed that there is enough performance in distributed machines?

ChrisRackauckas · July 29, 2018, 2:03am

Elemental.jl, Trilinos.jl, and PETSc.jl

Gregstrq · July 29, 2018, 4:52am

Is there any kind of extensive tutorial on PETSc which covers creating a matrix and all other steps? The documentation on PETSc.jl points to the documentation for the underlying library, but it is a bit incomprehensive.
By the way, do you know if it is possible to use PETSc with mixed parallelism? Distributed across the nodes, but shared memory and multithreaded inside each node?

simonbyrne · July 29, 2018, 5:41am

Also, p = 0.5 is way too high for typical uses of sparse matrices: see this discussion.

ahojukka5 · July 29, 2018, 4:39pm

Agree. A real sparse matrix structure in our case is typically positive definite, symmetric and has a small bandwidth.

dlfivefifty · July 29, 2018, 7:12pm

If it’s banded, adding support for distributed matrices in BandedMatrices.jl is in the pipeline, though a few years off.

EDIT: you might not need DistributedArrays if you work with BandedMatrix, though make sure you compile Julia with MKL as OpenBLAS currently crashes for banded matrices of size more than 10 million.

ahojukka5 · July 29, 2018, 8:40pm

What we are looking for is a some kind of solution for a quite standard parallel solution of large physical problems which are discretized using e.g. finite element method. Typical setup in cluster environment is that there is 1-20 machines, all machines having 12-48 cores and having memory about 24-1024 GB. The calculation cluster can be considered as homogeneous so that each machine in computational job is considered to be same, so that solution of some physical problem is done using for example 10 machines, each containing 128 GB memory and having 24 cores, so as a summary we have calculation power of 240 cores with 1280 GB of memory. This kind of setup can easily find solution for physical problems when the amount of dofs is something 10-15 million. And this kind of resources we should utilize to get solutions for big problems.

I can only verbally describe the solution cycle, but for my understanding is that for each calculation node, if we efficiently use all the machine performance, we need to

assemble the part of the larger coefficient matrix A and right hand side f, let’s call those A_i and f_i, threaded locally on machine i
solve the global system Ax = b

To be efficient, this cycle needs to be “mixed parallel” like proposed by @Gregstrq, so that domain decomposition is done only for N pieces where N is the number of available computers. Inside each node threading is used to make all operations efficient, so that (in example) all 24 cores is utilized for constructing the final matrices.

From the list above, one is missing, MUMPS.jl. We have a good experience using MUMPS to make things parallel. So it also could work.

What we are looking for and what are we missing, is a good workflow to solve big problems. A big extra for the setting is of course to make anything ready for cloud-computing as one can always buy more computational power from cloud services when the own machines are not big enough for big problems. Given credit card, A and b, what is the solution?

Topic		Replies	Views
Parallel sparse matrix vector product Internals & Design	4	1404	June 29, 2018
Distributed/out-of-memory/GPU calculations on sparse matrices General Usage gpu	2	1371	March 22, 2017
Distributed matrix General Usage	2	465	June 15, 2021
Finite Element Computations On A Cluster Using Petsc.jl Numerics question , package	2	2335	April 21, 2017
How to parellalizing many matrix solves Performance	5	466	November 23, 2020

Distributed computing of x = A \ b

Related topics