How to track total memory usage of Julia process over time

What is a good tool to track the memory usage of the Julia process over time? Ideally, I would like to find out how much memory is used by Julia at any given time, especially at peak times during, e.g., compilation, but also during the run of an application.

My motivation in this case is not to find out which part of the code allocates how much for performance reasons, but what the overall memory footprint of the Julia process is. The rationale is that when doing parallel runs with MPI.jl, we sometimes run into out-of-memory issues.

Our particular example: On a single node with 256 GiB of main memory we get an OOM error when running a simulation with Trixi.jl on 128 MPI ranks, even though the total application memory use is (conservatively) estimated to be < 50 GiB. Thus it seems that the memory overhead of the Julia process itself is more than 1 GiB (maybe due to memory required during the initial compilation pass?), but we would like to find out for sure.

2 Likes

The allocation profiler may be useful here to visualise the memory footprint of your application during runtime. There’s a good Julia Con talk - Hunting down allocations with Julia 1.8's Allocation Profiler | JuliaCon 2022

I don’t think this will be exactly what you want, and I’m interested to see what others suggest.

As a quick question, if you are using a single node, is it possible to use multithreading instead of MPI, as I think Trixi supports that? I suspect this will have a much lower memory footprint, and less likely to run out of memory. I wouldn’t be surprised if each process uses at least 500-1000MB depending on the size of the packages loaded, this isn’t uncommon.

How about logging the high water mark as reported by the ps command? Run it in a loop with sleep of a second or so?

Thanks for the link, it was insightful to watch! However, as you already pointed out, it is not the solution I am looking for (I am not trying to track down memory allocations and where they occur in my code, but rather the overall memory footprint Julia has).

You are right - when running on such many-core CPU nodes, it does not make sense do use a pure MPI parallelization. However, in the past we’ve had some issues with MPI+multithreading on some machines and for certain MPI implementations, thus at the moment hybrid parallelization is not guaranteed to work (for us). Pure MPI parallelism is thus an important fallback (even though it comes with its own issues, of course).

Yes, doing something like this could be a fallback solution. However, it will probably not work in a multi-node setup on a cluster (at least not easily).

FYI, I had some very insightful discussions with @vchuravy on Slack, some of which I would like to share here (and to record it for future reference):

  1. He suggested a neat trick using a self-resurrecting finalizer to automatically print the maximum RSS (resident set size as reported by Julia’s Sys.maxrss(), which internally uses the C function getrusage):
julia> function setup_memuse_tracker()
          tracker = Ref(0)
          function mem_use(tracker)
            finalizer(mem_use, tracker)
            @info "Current memory used" Sys.maxrss()
            nothing
          end
          
          finalizer(mem_use, tracker)
          nothing
       end
setup_memuse_tracker (generic function with 1 method)

julia> setup_memuse_tracker()

julia> GC.gc()
┌ Info: Current memory used
â””   Sys.maxrss() = 0x000000001280b000
  1. He suggested looking at the procfs file system, e.g., /proc/<pid>/statm. This is similar to the suggestion of using ps by @PetrKryslUCSD, but could forgo the need for an external program.
  2. He further suggested to look at the output of Base.gc_num(). Alternatively, at the values of Base.gc_live_bytes() and/or Base.jit_total_bytes().
  3. Finally he suggested to take a look at the tool Arm Forge, which combines the Arm DDT/Arm MAP profiling tools. However, one has to pay for those.
4 Likes

Based on @vchuravy suggestions above, I gave it a shot and came up with two implementations in Julia.

The first one meminfo_julia() only uses built-in functions and thus should be relatively portable, while the second one meminfo_procfs() queries the procfs filesystem directly:

using Printf: @printf

function meminfo_julia()
  # @printf "GC total:  %9.3f MiB\n" Base.gc_total_bytes(Base.gc_num())/2^20
  # Total bytes (above) usually underreports, thus I suggest using live bytes (below)
  @printf "GC live:   %9.3f MiB\n" Base.gc_live_bytes()/2^20
  @printf "JIT:       %9.3f MiB\n" Base.jit_total_bytes()/2^20
  @printf "Max. RSS:  %9.3f MiB\n" Sys.maxrss()/2^20
end

function meminfo_procfs(pid=getpid())
  smaps = "/proc/$pid/smaps_rollup"
  if !isfile(smaps)
    error("`$smaps` not found. Maybe you are using an OS without procfs support or with an old kernel.")
  end

  rss = pss = shared = private = 0
  for line in eachline(smaps)
    s = split(line)
    if s[1] == "Rss:"
      rss += parse(Int64, s[2])
    elseif s[1] == "Pss:"
      pss += parse(Int64, s[2])
    elseif s[1] == "Shared_Clean:" || s[1] == "Shared_Dirty:"
      shared += parse(Int64, s[2])
    elseif s[1] == "Private_Clean:" || s[1] == "Private_Dirty:"
      private += parse(Int64, s[2])
    end
  end

  @printf "RSS:       %9.3f MiB\n" rss/2^10
  @printf "┝ shared:  %9.3f MiB\n" shared/2^10
  @printf "┕ private: %9.3f MiB\n" private/2^10
  @printf "PSS:       %9.3f MiB\n" pss/2^10
end

The output from both is as follows:

julia> meminfo_julia()
GC total:     29.837 MiB
GC live:      34.361 MiB
JIT:           0.017 MiB
Max. RSS:    183.168 MiB

julia> meminfo_procfs()
RSS:         190.602 MiB
┝ shared:      3.215 MiB
┕ private:   187.387 MiB
PSS:         187.696 MiB

It seems that the numbers obtained from Julia directly are missing quite a bit of untracked memory, likely due to not taking into account the size of the Julia code itself plus shared libraries loaded by Julia. This is can be verified by, e.g., running a second julia process on the same node and then querying meminfo_procfs() again:

julia> meminfo_procfs()
RSS:         190.605 MiB
┝ shared:     53.977 MiB
┕ private:   136.629 MiB
PSS:         162.266 MiB

In this case, ~50 MiB get shifted from private to shared RSS, i.e., this is approx. the memory required for shared libraries loaded by Julia itself (and not used by any other program running). During the first invocation, this is counted as private (since there is only one program using them), in the second instance it is shared (since two Julia processes are using them now). However, that still leaves ~100 MiB not accounted for (surely this is not the Julia executable alone?).

Thus, when looking at these two particular solutions, it seems like both approaches can give you valuable information from within Julia itself. One advantage of meminfo_julia is that it breaks down memory usage by category, with the downside that total memory use (RSS) is only counted as a maximum over time (i.e., non-decreasing). On the other hand, meminfo_procs can get you real-time information on total (=external memory use), even for other processes than the currently running one. At the same time, it has a much higher performance impact itself (~3ms for meminfo_procfs vs. ~2.8ÎĽs for meminfo_julia; numbers with I/O disabled).

EDIT: Fix implementation of meminfo_julia() based on remark by @mkoculak below.

8 Likes

It seems the method of @vchuravy (pure genius!) is going to tell us the high-water mark at each GC collection. Isn’t that superior to something that you have to call yourself?

Actually, I’m getting an error:

error in running finalizer: ErrorException("task switch not allowed from inside gc finalizer")   

Could that be the @info "Current memory used" Sys.maxrss()?

Hm, it works for me with Julia v1.8.3 on Linux and macOS. Which OS and Julia version are you using?

Shouldn’t the first line in meminfo_julia() have Base.gc_total_bytes()?

Yes, you are right. In fact, it should read Base.gc_total_bytes(gc_num()) since gc_total_bytes expects and argument of type GC_Num. However, I found that this usually underreported memory usage, thus I left it out after some initial tests. The implementation above was not properly cleaned up, thanks for catching this - I fixed it in the post above.

Yeah, I also figured out the addition of Base.gc_num() but rather had the impression that it is overestimating the memory footprint:

GC total:   1160.298 MiB
GC live:      65.111 MiB
JIT:           0.872 MiB
Max. RSS:    343.137 MiB

Although it seems to match the combined memory I see in task manager of two Julia processes I have running (one being the LSP in VSCode, I assume).

Hm, this seems … fishy to me. As far as I understand, the maximum RSS should be a strict upper bound on the total memory use unless your system is already swapping to disk. Thus it seems to me that the value for GC total is overreporting something…

Yeah, so starting a new Julia REPL straight from Powershell gives me:

GC total:     23.568 MiB
GC live:      27.080 MiB
JIT:           0.013 MiB
Max. RSS:    168.758 MiB

with task manager reporting ~125 MB, so there is some interaction with GC total and Julia extension/LSP (I have no idea how LSP instance of Julia is connected to the REPL spawned by the VSCode extension).

Are you sure about that? If you’re thinking about Max. RSS, it’s not actually current RSS, but whatever was Max. at some point in the past. I.e. it must only go up. RSS from meminfo_procfs seems to only go up until some pressure is applied. I didn’t see RSS go down unless I put memory pressure on some other process, and then only by down by 1-180.680/292.375 = 38%. I was trying to allocate as much memory as could in a different Julia process (I think I went pretty close to the max I could). Going to high I got:

julia> A = fill(2, 5000000000)
ERROR: OutOfMemoryError()

slightly lower:
julia> A = fill(2, 4000000000)
Killed

I’m not sure what the kernel does, it could always take memory from you, at least swap out from the unused part at the end of your memory allocation. I’m not sure if can look at holes in you heap. I guess it must infer those, the fragmentation, from your memory usage.

Just after starting Julia:

julia> Sys.maxrss()/1000/1000  # likely because starting up, processing (the default) sysimage uses a lot of mem.:
181.940224

julia> Base.gc_live_bytes()/2^20
9.989258766174316

julia> GC.gc()

julia> Sys.maxrss()/1000/1000  # interesting "mem use" actually went up, but I guess actual max (which is not possible to query?) only temporarily:
216.727552

julia> Base.gc_live_bytes()/2^20
2.956634521484375

RSS is the Resident Set Size and is used to show how much memory is allocated to that process and is in RAM. It does not include memory that is swapped out. […] It does include all stack and heap memory.

VSZ is the Virtual Memory Size. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries.

Sys.maxrss() uses getrusage (on Linux) according to its help, it actually does ccall(:jl_maxrss, Csize_t, ()) and I see:

In most systems, processes has only two valid values:

RUSAGE_SELF

Just the current process.

RUSAGE_CHILDREN

All child processes (direct and indirect) that have already terminated. 

I suppose RUSAGE_SELF is used, but RUSAGE_CHILDREN might also be good info (on a cluster). Also this what I though impossible to do (all threads in a process share memory space, so while they could track their own allocations, I thought the kernel couldn’t know about that, maybe I misunderstand what this is about):

   RUSAGE_THREAD (since Linux 2.6.26)
          Return resource usage statistics for the calling thread.

What is this, it has no help string:

gc_total_bytes(gc_num::GC_Num) =
    gc_num.allocd + gc_num.deferred_alloc + gc_num.total_allocd

Is that total ever allocated?

struct GC_Num
[..]
   total_allocd    ::Int64 # GC internal
[..]
end

gc_num() = ccall(:jl_gc_num, GC_Num, ())

# This type is to represent differences in the counters, so fields may be negative
struct GC_Diff

Are you sure about that? Max. RSS, only goes up, but RSS from meminfo_procfs seems to also only go up until some pressure is applied. I at least didn’t see RSS go down unless I put memory pressure on some other process, and then only by down by 1-180.680/292.375 = 38%. I was trying to allocate as much memory as could in a different Julia process (I think I went pretty close to the max I could allocate). Going to high I got:

julia> A = fill(2, 5000000000)
ERROR: OutOfMemoryError()

slightly lower:
julia> A = fill(2, 4000000000)
Killed

I’m not sure what the kernel does, it could always take memory from you, at least swap out from the unused part at the end of your memory allocation. I’m not sure if can look at holes in you heap. I guess it must infer those, the fragmentation, from your memory usage.

Just after starting Julia:

julia> Sys.maxrss()/1000/1000  # likely because starting up, processing (the default) sysimage uses a lot of mem.:
181.940224

julia> Base.gc_live_bytes()/2^20
9.989258766174316

julia> GC.gc()

julia> Sys.maxrss()/1000/1000  # interesting "mem use" actually went up, but I guess actual max (which is not possible to query?) only temporarily:
216.727552

julia> Base.gc_live_bytes()/2^20
2.956634521484375

RSS is the Resident Set Size and is used to show how much memory is allocated to that process and is in RAM. It does not include memory that is swapped out. […] It does include all stack and heap memory.

VSZ is the Virtual Memory Size. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries.

Sys.maxrss() uses getrusage (on Linux) according to its help, it actually does ccall(:jl_maxrss, Csize_t, ()) and I see:

In most systems, processes has only two valid values:

RUSAGE_SELF

Just the current process.

RUSAGE_CHILDREN

All child processes (direct and indirect) that have already terminated. 

I suppose RUSAGE_SELF is used, but RUSAGE_CHILDREN might also be good info (on a cluster).

   RUSAGE_THREAD (since Linux 2.6.26)
          Return resource usage statistics for the calling thread.
RssAnon		size of resident anonymous memory

I would like to know what anon mem is, it would be good is someone could answer (also there, it’s hard to google for this…). For Julia PSS is almost as high as RSS, and PSS is split three ways, most of it is Pss_Anon: which happens to be same amount as Anonymous: and Private_Dirty:. Next largest size is Pss_File: (slightly higher than Private_Clean:). Maybe it could/should be subtracted from RSS?

Looking at pid of julia:

$ cat /proc/2234972/smaps_rollup