Finite Element Computations On A Cluster Using Petsc.jl

I want to write a parallel distributed isogeometric analysis solver for linear and
non-linear elastostatics problems using Julia. For this purpose I need access to
distributed linear and non-linear solvers. My first thought went out to using the
Petsc.jl package.

My question is if anybody has used Petsc.jl to do computations on a cluster
using distributed arrays, etc. If so, what are the difficulties in running Julia in
combination with PETSC on a large cluster?

Regarding Petsc.jl, the linear solver capability is there, the nonlinear solvers are not. I believe it also requires julia v0.4. My research has kept be quite busy recently so I haven’t had time to work Petsc.jl in a while, and I suspect I won’t have time in the near future.

Regarding writing a parallel FE solver and running it on a cluster, I have done this with my research code (I work with summation-by-parts operators on unstructured grids, which are strictly speaking finite difference operators, but in implementation are very much like finite elements). A few things to watch out for:

  1. Building Julia (for the compute notes)
    By default, the Julia build system looks at the hardware of the computer it is being built on and compiles itself for that architecture. On clusters, you typically build software on the front end node but run it on the compute nodes, which may have different architectures. So you will likely have to override Julia’s architecture detection and tell it what architecture the compute nodes are.

  2. Internet access (or lack thereof)
    I don’t know how your cluster is configured, but the ones I have access to don’t allow the cluster to initiate outgoing internet connections. That means when building julia I had to do make -c deps getall on another system to have julia download everything it needed and copy the files over to the cluster and build julia there.
    This also means Pkg.add doesn’t work on the cluster. To workaround this, I had to write my own little build system to download all dependencies, have the dependencies do any internet activities, and tarball all the files and copy them over to the cluster.

  3. Using system libraries (whenever possible)
    Julia packages that have binary dependencies tend to download and build them inside their package /deps directory. On clusters, I would use system libraries whenever possible, because sometimes clusters have strange configurations and things don’t compile with default settings. You’ll have to check with any Julia packages that have binary dependencies to see if they will use system libraries if available.

  4. System wide package installation
    If there are multiple people using your code on the cluster, you might want to install all your Julia dependencies in a central location to be used by everyone. Unfortunately, I don’t think there is a good way to do that with the current package manager. See discussion here: System wide Pkg installation - #8 by JaredCrean2

  5. MPI vs Julia’s native parallelism
    Petsc uses MPI, so you’ll definitely need MPI, but for your solver you have the choice of using MPI directly or using Julia’s native parallism, which uses the Remote Procedure Call (RPC) paradigm. In my opinion (people may disagree with this point), RPC isn’t the right paradigm for expressing the algorithms used in parallel finite element codes. Message passing is much more natural. I would use MPI directly (see MPI.jl) rather than Julia’s RPC.

I have done simulations with up to 640 cores, and everything works. I have done a scaling study up to 288 cores and got 88% strong scaling efficiency, which I think is pretty good for that case (solving linear advection with an explicit Runge-Kutta method). I guess the summary of this would be that getting everything built on the cluster is a pain, but after that everything works pretty well.

3 Likes

Thanks JaredCrean2 https://discourse.julialang.org/users/jaredcrean2 for
your extensive reply. Great work regarding PETSc.jl. I think it could
benefit many people in our community.