Performance of garbage collection

I am trying to optimize the performance of training a neural network model (using Flux.jl and Mill.jl packages). I figured out that significant time is being spent on garbage collection (when @timing individual parts of the script, I often see GC time around 80-90%. As a proof-of-concept, I have overriden Array constructors by static allocation (in the spirit of Alloc.jl, but without using IRTools) which led to 10-15x speed-up when computing gradients.

Specifically, I am currently profiling deserialization of minibatches, where minibatch consists of several wide and deep tree structures. I noticed that when I run the deserialization in a relatively fresh Julia, the time required for deserialization is roughly constant:

julia> for i in 1:20
         mb = @time deserialize("mb.jls")
       end
  1.010540 seconds (4.67 M allocations: 593.807 MiB, 12.95% gc time)
  0.987930 seconds (4.67 M allocations: 593.807 MiB, 10.46% gc time)
  1.057710 seconds (4.67 M allocations: 593.807 MiB, 16.85% gc time)
  1.188085 seconds (4.67 M allocations: 593.807 MiB, 25.05% gc time)
  1.280059 seconds (4.67 M allocations: 593.807 MiB, 29.30% gc time)
  1.220317 seconds (4.67 M allocations: 593.807 MiB, 27.63% gc time)

While when I load additional data in memory, the garbage collection seems to be much more aggressive which have negative implications on the runtime:

julia> for i in 1:20
         mb = @time deserialize("mb.jls")
       end
  6.243606 seconds (4.67 M allocations: 593.807 MiB, 85.56% gc time)
  0.920360 seconds (4.67 M allocations: 593.807 MiB)
  5.957967 seconds (4.67 M allocations: 593.807 MiB, 84.44% gc time)
  0.949225 seconds (4.67 M allocations: 593.807 MiB)
  5.960878 seconds (4.67 M allocations: 593.807 MiB, 84.32% gc time)
  0.949110 seconds (4.67 M allocations: 593.807 MiB)

It seems that in the second case, the garbage collection time is significantly higher then the time needed to read and construct all the datastructures. Note that in both cases, there is plenty of free memory in the system (> 200GB) so the garbage collection should not be necessary. The code has been run in Julia 1.3.1.

Could you please give me some hints how to debug and avoid such situations? Also, are there some options to tweak garbage collection (e.g., similar to -Xmx and -Xms parameters) that would make garbage collection less aggressive? (In Details about Julia's Garbage Collector, Reference Counting? the answer was negative, but I am not sure if something has not been changed in the meanwhile.)

Thank you in advance!

6 Likes

I am also having the same issue, for a similar reason. I am loading and preprocessing a large amount of data on a high-memory machine in a complicated, multistep process. Despite a low allocation rate, the data loading is hampered by nasty multi-second GC pauses. Ironically, it is likely the intermittent preallocation of huge arrays throughout the computation—specifically to avoid allocations during various subcomputations—that is triggering the full GC runs.

It’s 256GB machine, with total allocations of about 1.5TB spread out over the 25min computation, but the maximum resident size is only 80GB (most of which is a 50GB cache of 25 2GB arrays used to circumvent some recomputations). So full GC does not need to run often, but instead it’s eating about 3s out of every 7s.

Alas, Julia’s GC params are hard-coded. Some ability to tune GC would be very helpful!

1 Like

Yeah, this is a known issue. One thing that might be worth trying is disabling the GC when you know you aren’t likely to run out of memory.

Is your code multithreaded, by any chance?

Yes, it’s very heavily performance tuned, using threads. That’s basically how I can see the GC pauses: when the computation only eats 1 CPU as seen in htop.

The small individual pieces of the computation preallocate and don’t trigger GC, but then there is hierarchy of in-between steps of copying large chunks of data, or re-preallocating for the next subcomputation, and I think that’s what’s causing trouble.

Surprisingly, that doesn’t seem to help.

Normal GC:
1545.059279 seconds (247.87 M allocations: 1.442 TiB, 43.71% gc time)

GC selectively off for computation subparts:
1715.983475 seconds (247.65 M allocations: 1.442 TiB, 45.75% gc time)

1 Like