I have ported the dynamical core of our C++ CFD code to Julia (https://github.com/Chiil/MicroHH.jl) and have been implementing MPI.jl over the past days. I ran my first few large test jobs (1024 cores on 8 EPYC nodes with 128 cores per node) and discovered that every Julia task that is running uses ~600 MB for loading the code, even before doing the memory allocations that contain the 3D fields, whereas our reference C++ code uses only 11 MB before allocation. On a cluster that has 2GB per core available and that also needs to have some free memory for MPI buffers, that puts me in the situation that I have too little memory left to do simulations with sufficient grid points per task.
In order to verify that I did not allocate excessive fields, I ran the model at a 2 x 2 x 2 grid, but even then I see this usage. I understand that Julia cannot deliver the memory footprint of a compiled C++ code, but what are good practices to significantly reduce the base memory used? If I could reduce it to ~100 MB I would have enough memory available to do a proper scaling simulation.
Peeking through your code, you appear to be using shared-memory multithreading, which should only require one Julia process per node rather than one per core. How are you allocating the required resources and launching the job on your cluster? If you’re using Slurm, you’ll want an invocation along these lines: https://www.carc.usc.edu/user-information/user-guides/hpc-basics/slurm-templates
My MPI code is in the mpi branch. The code supports both MPI and multithreading, and I am testing the optimal balance between the two. So far, one MPI task per core seems the fastest (as nearly always in the CFD codes that I have worked with), but Julia is pretty memory hungry.
Julia’s upcoming minor release (v1.8) should allow you to build a lightweight runtime without the BLAS buffers and compiler infrastructure that’ve been historically bundled with every Julia process, which should bring your per-process memory consumption down to the ~30 MB range, IIRC. That said, shared-memory multithreading should be faster, and performance tuning is best informed by profiling.
Separately, I noticed a few untyped struct fields in your code - if possible, you should annotate those fields to allow the compiler to do its job rather than needing to unbox fields at runtime.
I looked a bit further into my simulation. I allocate in total ~260 MiB memory, while the total memory usage of Julia is 1.16 GB. If I could free some of this 900 MiB, I could run my parallel jobs more efficiently.
julia> varinfo(imported=true, recursive=true, sortby=:size)
name size summary
–––––––––––––––––––– ––––––––––– ––––––––––––––––––––––––––––––––––––––––––––
Base Module
Core Module
Main Module
m 257.079 MiB Model{Float32}
MicroHH 1.439 MiB Module
InteractiveUtils 253.916 KiB Module
settings 7.112 KiB 1-element Vector{Dict{String, Dict{String}}}
settings_d01 7.065 KiB Dict{String, Dict{String}} with 5 entries
settings_grid 4.609 KiB Dict{String, Any} with 7 entries
z 4.039 KiB 1024-element Vector{Float32}
settings_boundary 594 bytes Dict{String, String} with 4 entries
settings_timeloop 535 bytes Dict{String, Float64} with 5 entries
settings_fields 497 bytes Dict{String, Real} with 2 entries
settings_multidomain 364 bytes Dict{String, Bool} with 1 entry
float_type 128 bytes DataType
n_domains 8 bytes Int64
zsize 4 bytes Float32
ans 1 byte Bool
in_progress 1 byte Bool
eval 0 bytes eval (generic function with 1 method)
include 0 bytes include (generic function with 2 methods)
make_grid 0 bytes make_grid (generic function with 1 method)
Close to 385 MiB if I only do the using MicroHH which is the module that contains our code. It goes to 684 if I initialize all data and to ~1GB if I run all solvers once.