# Implementing parallel fluids code in Julia (parallel FFTW, MPI.jl, ...)

#1

I’m considering implementing a parallel fluid simulation code in Julia, either reimplementing my Fourier-Chebyshev http://www.channelflow.org code, or doing a simpler isotropic turbulence in a triply-periodic cube as a simpler test case.

But it’s not clear to me how to do it with Julia’s current parallel infrastructure. MPI.jl appears to be limited to julia-0.4 (https://github.com/JuliaParallel/MPI.jl), and this 2014 post says Julia does yet support distributed-memory FFTW calls (https://groups.google.com/forum/#!topic/julia-users/ww-57NucHKA).

I would actually like to leave MPI behind and do parallelism at a higher level with distributed arrays. and native Julia parallel constructs. But distributed memory FFTW is a must, including FFTW’s distributed-memory transpose functions.

Anyone out there with relevant Julia expertise and experience in this kind of distributed-memory scientific computation? Tips and pointers would be much appreciated. @stevengj?

#2

Look at ClusterManagers.jl

From what I have read, I think you can use this to make Julia’s internal parallelism use the fast (Infiniband) interconnects of an HPC, getting you the right speed but also giving you an easier programming model. Otherwise Julia reverts to TCP?

That said, I forget where I read that… and haven’t given it a try. Hopefully, someone does know this and will clarify. I am interested in starting to do something like this as well.

#3

MPI.jl works fine with 0.5. It probably needs some updates before it works with 0.6 though. The comment about 0.4 only covers the MPIManager and it was added when 0.3 was the latest released version of Julia.

You are not the first to show interest for distributed memory FFTW but I’m not aware of anybody who has wrapped the functionality yet. It might be possible to get it work with a combination of MPI.jl and DistributedArrays.jl but it will require some work.

#4

Are the Julia constructs compared somewhere against MPI? How does this work together with e.g. Infiniband? Or is MPI still used in the backend somehow?

#5

+1 on the original question. I’d love to see any return of experience from a julia port of a “standard” scientific computing application using e.g. MPI+OpenMP with FFTs and distributed linear algebra, and use on supercomputers. For all the talk about julia being “Designed for parallelism and distributed computation” this is strangely lacking.

#6

The Cluster Managers with custom transports section in the Julia documentation has a little information on creating custom transport layers for Julia’s native distributed memory parallelism. There’s an example in the Julia distribution of an implementation using the zmq library. I’ve had a thorough look through the internet for an infiniband implementation but haven’t been able to find one. I looked into doing it myself and it looks like a lot of work. However, I know very little about networking so maybe it wouldn’t be so bad for an expert.

On a personal note, I’ve found Julia’s native parallel programming constructs to work great for very coarse grained embarrassingly parallel calculations. I’d love to see some examples of pure Julia implementations of algorithms with more complex communication patterns. Coming from an MPI programming background the Julia constructs seem quite awkward for this but it could well be that I’m just missing something—I should really try to get a deeper understanding of channels and RemoteChannels.

#7

@ChrisRackaukus and @andreasnoack, thanks for the clarifications and advice.

@antoine-levitt: Agreed. I have high hopes for Julia as a game-changer in scientific computing, and I think that a real distributed-memory scientific HPC code would be an important demonstration of Julai’s potential to replace the ageing and painful C/Fortran/MPI framework. I recently gave a “Why Julia?” talk to a group of fluid dynamicists, and this was the main response I got: show me benchmarks of a Julia HPC code against C and Fortran.

I think my work is cut out for me: try to get distributed-memory FFTW in Julia working and coordinated with DistributedArrays.jl and MPI.jl. Once that’s in place, write a parallel fluid simulation code. Piece of cake!

#8

#9

@stevengj : What are the prospects for getting distributed-memory FFTW in Julia working with DistributedArrays? Is there a fundamental structural clash between C/MPI FFTW and DistributedArrays.jl that makes it difficult and unlikely, or is it clear how to do it and just not done yet?

Sorry to poke you individually, but you’re the expert on this. I’m willing to do the work if I can, but I will need some guidance.

#10

At first glance, I don’t think there should be any particular problem, since DistributedArrays.jl lets you specify the distribution. So, you would specify the same distribution as FFTW, adjusted for row-major vs. column-major (i.e. divided long the last dimension in Julia, corresponding to the first dimension in FFTW), create a plan, and execute it on the local part of the array in each process.

A slight annoyance is building the FFTW dependency, because Julia already ships with the serial FFTW (although this is likely to be split into a separate package in the future). You would need to make sure to build the FFTW MPI library using the same version of FFTW as is shipped with Julia, since the MPI FFTW calls the serial FFTW.

#11

I’m curious about your priorities here. Unless you have enormous domains in view, it would seem that the best target is a cluster with at least a few nodes having enough memory to hold an entire scalar field, so all transforms can be done locally, ideally with GPU or Phi acceleration. (For the record, I use FFTW almost every day, and am eternally grateful for its existence, but it’s not the only player in town.) You have to deal with data distribution for field products, time stepping, and so on anyway, so why start with transforms?

#12

The classic case of channel flow has a fluid in a 3d rectangular box with rigid walls at y=\pm 1. The boundary conditions are no-slip, i.e. zero velocity at the walls and periodic in the wall-parallel directions x,z. The appropriate spatial discretization is then a Fourier x Chebyshev x Fourier spectral expansion for x,y,z. At each time step, you have to Fourier transform in each of the three directions (real cosine transform in y for the Chebyshev expansion) in order to compute the nonlinear term u \dot \grad u in the Navier-Stokes equation. To do this you transpose the data in distributed memory so that the data is x-major, do Ny Nz local transforms in x, then transpose so the data is y-major, do Nx Nz local y transforms, etc. Distributed-memory FFTW has all this worked out, so you call an appropriate 3d transform function and it does all the appropriate distributed-memory transposes and local transforms for you.

P.S.How do you turn on Latex markup in Discourse?

#13

For even very modest-sized spatial domains and discretizatons, say 128 x 129 x 128, distributed memory and parallel processing is a win in execution speed, even though that amounts to only a few hundred megabytes of data. But simulations of thousands x hundreds x thousands are commonplace, and for this distributed memory is a necessity.

#14

The transforms are the most complicated part of the algorithm from a data-management perspective, because of all the distributed-memory data transposes necessary for the multi-d FFTs. If I can get the distributed-memory FFTs and transposes under control, the rest of the algorithm is totally manageable, even though it’s more complex mathematically. For an Nx, Ny, Nz grid, the rest of the algorithm is just Nx * Nz independent and trivially parallelizable 1d boundary value problems in y.

#15

Hi @John_Gibson ,

I’m also considering implementing a parallel fluid simulation code in Julia, more specifically the isotropic case. Have you had any success getting the distributed-memory FFTW working in Julia? I believe the only missing feature I need is the distributed transpose (ideally an in-place transform), the rest could be done with the serial FFTW.
I’m planning on tackling the distributed transpose problem first. That is, if no one has implemented it yet…

#16

I would suggest doing a serial version first, both as a proof of concept and a reference for benchmarking any future parallel versions.

#17

I have not started coding yet. I spoke to a number of people about this problem at JuliaCon last week. Andreas Noack and Viral Shah both indicated that DistributedArrays should be able to do the transposes via a permutation function.

If you do the periodic-cube isotropic problem, I’ll do the channel-geometry problem.

#18

You might be right about that. There’s plenty of work to do just to get the mathematical part of the algorithm working, besides the parallelism. There’s no question that Julia is ready for the serial algorithm, and it would be a good process for me to learn to use the language properly.

Maybe I’ll approach it from boths ends, developing the math part serially while experimenting separately with parallel transforms and transposes.

#19

Please keep us updated if and when you do that. I’m in particular curious of how feasible it is to build a working complex code entirely out of reusable parts. (e.g. discretize your problem using ApproxFun, solve it using DiffEq/NLSolve…). Those packages look flexible and open enough that it might be possible.

#20

I’ll definitely keep you & this thread posted.

At this point I don’t think using ApproxFun and DiffEq is the right approach. The time-stepping and spatial discretization methods for these kinds of algorithms are just too tightly intertwined. (see pages 29-35 of http://channelflow.org/dokuwiki/lib/exe/fetch.php?media=docs:chflowguide.pdf)

I would be interested in trying to combine ApproxFun and DiffEq to solve a simpler test case of a spectral PDE algorithm, say for the 1d Kuramoto-Sivashinky equation on a periodic domain, or on a boundded domain with Dirichlet or Neuman BCs. If it works there, it’d be worth thinking about for Navier-Stokes. I have already implemented Kuramoto-Sivashinsky in a periodic domain in Julia (in order to benchmark it against the same algorithm in Matlab, Python, and C++ --and Julia wins!). I’d be happy to share that if anyone wants to take a shot at a ApproxFun+DiffEq implementation.