Excessive memory usage parallel Julia job

I have ported the dynamical core of our C++ CFD code to Julia (GitHub - Chiil/MicroHH.jl: Julia implementation MicroHH) and have been implementing MPI.jl over the past days. I ran my first few large test jobs (1024 cores on 8 EPYC nodes with 128 cores per node) and discovered that every Julia task that is running uses ~600 MB for loading the code, even before doing the memory allocations that contain the 3D fields, whereas our reference C++ code uses only 11 MB before allocation. On a cluster that has 2GB per core available and that also needs to have some free memory for MPI buffers, that puts me in the situation that I have too little memory left to do simulations with sufficient grid points per task.

In order to verify that I did not allocate excessive fields, I ran the model at a 2 x 2 x 2 grid, but even then I see this usage. I understand that Julia cannot deliver the memory footprint of a compiled C++ code, but what are good practices to significantly reduce the base memory used? If I could reduce it to ~100 MB I would have enough memory available to do a proper scaling simulation.

Peeking through your code, you appear to be using shared-memory multithreading, which should only require one Julia process per node rather than one per core. How are you allocating the required resources and launching the job on your cluster? If you’re using Slurm, you’ll want an invocation along these lines: https://www.carc.usc.edu/user-information/user-guides/hpc-basics/slurm-templates

My MPI code is in the mpi branch. The code supports both MPI and multithreading, and I am testing the optimal balance between the two. So far, one MPI task per core seems the fastest (as nearly always in the CFD codes that I have worked with), but Julia is pretty memory hungry.

Julia’s upcoming minor release (v1.8) should allow you to build a lightweight runtime without the BLAS buffers and compiler infrastructure that’ve been historically bundled with every Julia process, which should bring your per-process memory consumption down to the ~30 MB range, IIRC. That said, shared-memory multithreading should be faster, and performance tuning is best informed by profiling.

Separately, I noticed a few untyped struct fields in your code - if possible, you should annotate those fields to allow the compiler to do its job rather than needing to unbox fields at runtime.

I looked a bit further into my simulation. I allocate in total ~260 MiB memory, while the total memory usage of Julia is 1.16 GB. If I could free some of this 900 MiB, I could run my parallel jobs more efficiently.

julia> varinfo(imported=true, recursive=true, sortby=:size)
  name                        size summary                                     
  –––––––––––––––––––– ––––––––––– ––––––––––––––––––––––––––––––––––––––––––––
  Base                             Module                                      
  Core                             Module                                      
  Main                             Module                                      
  m                    257.079 MiB Model{Float32}                              
  MicroHH                1.439 MiB Module                                      
  InteractiveUtils     253.916 KiB Module                                      
  settings               7.112 KiB 1-element Vector{Dict{String, Dict{String}}}
  settings_d01           7.065 KiB Dict{String, Dict{String}} with 5 entries   
  settings_grid          4.609 KiB Dict{String, Any} with 7 entries            
  z                      4.039 KiB 1024-element Vector{Float32}                
  settings_boundary      594 bytes Dict{String, String} with 4 entries         
  settings_timeloop      535 bytes Dict{String, Float64} with 5 entries        
  settings_fields        497 bytes Dict{String, Real} with 2 entries           
  settings_multidomain   364 bytes Dict{String, Bool} with 1 entry             
  float_type             128 bytes DataType                                    
  n_domains                8 bytes Int64                                       
  zsize                    4 bytes Float32                                     
  ans                       1 byte Bool                                        
  in_progress               1 byte Bool                                        
  eval                     0 bytes eval (generic function with 1 method)       
  include                  0 bytes include (generic function with 2 methods)   
  make_grid                0 bytes make_grid (generic function with 1 method)  

what’s the memory footprint of Julia process without running any of the code, just after import/using statements?

1.16 GB is tiny btw…

Close to 385 MiB if I only do the using MicroHH which is the module that contains our code. It goes to 684 if I initialize all data and to ~1GB if I run all solvers once.