What we are looking for is a some kind of solution for a quite standard parallel solution of large physical problems which are discretized using e.g. finite element method. Typical setup in cluster environment is that there is 1-20 machines, all machines having 12-48 cores and having memory about 24-1024 GB. The calculation cluster can be considered as homogeneous so that each machine in computational job is considered to be same, so that solution of some physical problem is done using for example 10 machines, each containing 128 GB memory and having 24 cores, so as a summary we have calculation power of 240 cores with 1280 GB of memory. This kind of setup can easily find solution for physical problems when the amount of dofs is something 10-15 million. And this kind of resources we should utilize to get solutions for big problems.

I can only verbally describe the solution cycle, but for my understanding is that for each calculation node, if we efficiently use all the machine performance, we need to

- assemble the part of the larger coefficient matrix A and right hand side f, letâ€™s call those A_i and f_i, threaded locally on machine i
- solve the global system Ax = b

To be efficient, this cycle needs to be â€śmixed parallelâ€ť like proposed by @Gregstrq, so that domain decomposition is done only for N pieces where N is the number of available computers. Inside each node threading is used to make all operations efficient, so that (in example) all 24 cores is utilized for constructing the final matrices.

From the list above, one is missing, MUMPS.jl. We have a good experience using MUMPS to make things parallel. So it also could work.

What we are looking for and what are we missing, is a good workflow to solve big problems. A big extra for the setting is of course to make anything ready for cloud-computing as one can always buy more computational power from cloud services when the own machines are not big enough for big problems. Given credit card, A and b, what is the solution?