Implementing parallel computing in finite element problems

Hi all,

I am very new to parallel programming, and relatively new to Julia. I have a spectral element code (higher order finite element) for numerical simulation of earthquakes, that I wish to parallelize.

I have read the manual, and I understand the usage of individual macros such as @spawn, @parallel for, etc, but I do not know the workflow for implementing it on a larger code.

My pseudocode looks like this:

  1. define parameters.
  2. setup (meshing, boundary conditions, etc)
while time < t_max:
    
    # Some simple vectorized computations

    # A conjugate gradient function to invert matrix

    # Some other vectorized computations

    # a series of for loops for element-wise computations

    # Save output variables

    # Compute the next timestep

end while

It is a 2D code, but takes like 70-80 hours to compute with a decent resolution (on a laptop). I have looked at each section of code, and the conjugate gradient takes the maximum time to compute, followed by the for loops. The conjugate gradient has some nested loops which I cannot avoid for now.

My question is:

  1. Is it enough to just have @parallel for loops in the entire code to improve performance? Or is it possible to have the time loop also parallelized?

  2. Should I be looking at MPI instead of Julia’s inbuilt macros for this kind of problem?

I know I have not given a whole lot of information, so here is the main function of my actual code:
https://github.com/thehalfspace/JuliaSEM/blob/develop/src/main.jl

It is kind of messy, sorry for that.

Any tips, pointers, ideas would be really helpful!

Thanks,
Prithvi

Disclaimer: I have not written distributed FEM code before, but I may very soon.

A few points that you may find useful:

  1. The conjugate gradient is defined in IterativeSolvers.jl.
  2. DistributedArrays.jl defines some distributed map and mapreduce functions and some other distributed linear algebra functions like the dot and mul! functions. These are everything you need to make a distributed CG algorithm.
  3. IterativeSolvers.jl is not yet fully supporting arbitrary array types, so if you try to pass the DistributedArray to the cg function, you will get some errors. When you do, you can open an issue on the Github repo to discuss how to fix it. And a few people there might help you get things up and running. In theory, IterativeSolvers.jl and DistributedArrays.jl can work seamlessly together but we are just a few type parameters away from that, I think. So get experimental and open issues.
  4. If your “matrix” is not an explicit matrix but in matrix-free form, you can define your own operator and pass it to cg. This operator can still make use of DistributedArrays to store element information for example.
  5. The map function for DistributedArrays might help you parallelize your other parts of the code or the matrix-free multiplication operator that you pass to cg.

Good luck!

2 Likes

If I understand your setting correctly, you are solving the equations of motion by using an explicit integrator, but because damping is included, a system of linear algebraic equations needs to be solved each time step?

1 Like

I would first speed up the serial code before even looking at parallelism. There’s too many allocations and such in there and that wouldn’t scale well. There’s a lot you can do even before parallelizing.

One you have efficient serial code, you still can improve the algorithms. You’re running explicit Euler, and so you can massively reduce the cost by increasing the stepsizes with some other explicit algorithm. Throw it into DifferentialEquations.jl and replace the linear solver with something like CG in there or something like that.

Then swap the arrays to DistributedArrays.jl or MPIArrays.jl to parallelize the broadcasts. Then you should have it all, with a few upstream PRs to IterativeSolvers.jl as @mohamed82008 said.

But the biggest changes are just changing the algorithm and optimizing the serial code.

4 Likes

Echoing @ChrisRackauckas That is what the keynote from Aviva said at JuliaCon. Concentrate on the serial code before parallelism.

1 Like

Thanks for the reply.

Yes, you are right. But instead of damping, I have a differential algebraic equation to solve. I am solving the equation of motion, where the velocity term would be implicitly defined as a function of friction state variables.

The timestepping in my setting is adaptive, where it is proportional to the velocity. I need very small timesteps (fraction of second) during earthquake onset, and large timesteps (fraction of year) during the time between two earthquakes (quasi-static). Therefore we have two different implementations. The quasi-static one uses Huen’s method (a second order Runge-Kutta method) for time updating. The dynamic part uses a modified multistage predictor-corrector strategy to update time.

Ref: https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2011JB008395

I am still learning the complete mathematics of the problem. I have a borrowed code from matlab, which I translated to julia with some modifications, and it has already made it more than twice fast.

Thanks for the suggestion. I am still in the process of learning the algorithms (Sorry, I’m not from a math background).

I had the code in matlab, which I translated to julia with some modifications, and it is already more than twice as fast.

The problem with my timestep is that it is adaptive. I have two different algorithms (call it quasi-static and dynamic) that I’m solving. The quasi-static one has time-step as fraction of year, whereas the dynamic one has timestep as fraction of seconds. The time-step is inversely proportional to the velocity of the medium.

As such, I have two different time-updating schemes. The quasi-static one uses Heun’s method (second order runge-kutta), and the dynamic one uses a multistage predictor corrector strategy to solve the condensed problem, and uses a custom updating algorithm.

Ref: https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2011JB008395

I tried to have all the constant values within struct, and all the dynamic allocations in the main before the while time loop. I do not have any allocations in the while loop. I imagined that even if the pre-allocations take some time, the while loop should not.

Is there a way to reduce the allocations in the present algorithm? I would definitely try to update to better algorithms as you suggest.

The one thing I cannot change is the number of timesteps. I am wondering how to go about increasing the speed in such a case.

I will definitely check out the cg of iterativesolvers.jl.

Thanks for the help!