Strange GC issue on Julia 1.12-rc1

I am happy that I can now use multi-threading on Julia 1.12rc1. I had to do some changes to my code, but now it works.

But the garbage collector behaves differently. I am using two different settings.
First settings:

julia --project -t 11 --gcthreads=4,1 

This works well on Julia 1.11, but my code crashes with these settings on Julia 1.12rc1.

Second setting:

julia --project -t 11,1 -i --gcthreads=4,0

This works well on Julia 1.12, but is very slow on Julia 1.11 (158 instead of 82s).

What could be the reason?

You can reproduce these results by running:

git clone https://github.com/ufechner7/FLORIDyn.jl.git
cd FLORIDyn.jl
git checkout 14fcb0e5ad3216aef2715ccc325a58d7a449e4b1
./bin/update_packages
./bin/run_julia

and then in Julia:

include("examples/main_video.jl")

The script needs about 82s on Ryzen 7950X both on Julia 1.11 and Julia 1.12rc1. The startup time of Julia 1.12rc1 is 3s higher. The run_jula script automatically switches between both GC options, based on the Julia version. If you want to reproduce the crash, change the script first. Crash happens on my machine after about half of the simulation (at about 500s simulation time, 40s wall time).

On v1.12.0-rc1 I get a precompilation error:

ERROR: LoadError: UndefVarError: `tmpM` not defined in `FLORIDyn`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
  [1] runFLORIDyn(plt::Nothing, set::FLORIDyn.Settings, wf::FLORIDyn.WindFarm, wind::FLORIDyn.Wind, sim::FLORIDyn.Sim, con::FLORIDyn.Con, vis::FLORIDyn.Vis, floridyn::FLORIDyn.FloriDyn, floris::FLORIDyn.Floris, pff::Nothing; msr::FLORIDyn.MSR)
    @ FLORIDyn ~/FLORIDyn.jl/src/floridyn_cl/floridyn_cl.jl:1036
  [2] runFLORIDyn
    @ ~/FLORIDyn.jl/src/floridyn_cl/floridyn_cl.jl:1004 [inlined]
  [3] runFLORIDyn(plt::Nothing, set::FLORIDyn.Settings, wf::FLORIDyn.WindFarm, wind::FLORIDyn.Wind, sim::FLORIDyn.Sim, con::FLORIDyn.Con, vis::FLORIDyn.Vis, floridyn::FLORIDyn.FloriDyn, floris::FLORIDyn.Floris)
    @ FLORIDyn ~/FLORIDyn.jl/src/floridyn_cl/floridyn_cl.jl:1004
...

It’s probably this line:

a = @allocated tmpM, wf = setUpTmpWFAndRun(set, wf, floris, wind)

Something about the @allocated macro has changed, so a begin ... end is required:

a = @allocated begin tmpM, wf = setUpTmpWFAndRun(set, wf, floris, wind) end

I’ll enter an issue in github for it. (EDIT: The `@allocated` macro differs between v1.11 and v1.12.0-rc1 · Issue #59256 · JuliaLang/julia · GitHub)

Otherwise, the run I did on v1.11 indicates that the run is very allocation intensive. This is known to cause performance problems for parallel programs, and any minor change in the garbage collector will have a major impact.

EDIT: The begin ... end will not allow tmpM and wf to be set.

Even better, put the @allocated in front of the function call instead of the assignment? The latter looks weird anyway.

The problem is to set tmpM and wf, and simultaneously obtain the number of bytes allocated. In 1.11, @allocated expanded to a sequence of expressions without an extra scope. In 1.12 the expression is put inside a closure.

Please, lets not discuss the @allocated issue in this thread. I created a separate thread for that issue: @allocated does not work on Julia 1.12rc1 To reproduce the GC issue, do

git checkout 14fcb0e5ad3216aef2715ccc325a58d7a449e4b1

Ah, I somehow had failed to check out the right version. It runs now.

To avoid cluttering my screen, I switched off intermediate display of results (vis.show_plots = false in main_video.jl).

I am not able to reproduce the crash with --gcthreads=4,1 with 1.12, so what this is, I have no idea.

As for the difference in performance, I notice that the run spends 20-40% in GC. This is the actual time spent in GC, not including time waiting for tasks “joining” the GC as far as I know. (All tasks must pause when the GC runs). So actual GC can easily be 30-50%, or more. I am not familiar with the details of GC, they may have changed it or tuned it differently between 1.11 and 1.12. Since the code relies heavily on the GC, any tuning change (like changing gcthreads from 4,1 to 4,0) in the GC may have a large impact.

what if it crashes because of plotting memory related GC

The plotting happens in a different process. If you run the program multi-threaded using ./bin/run_julia, ControlPlots is not even loaded in the primary process.

I know. My code is a good test case for the GC, I think. I try to reduce the allocations, but each time I tried to pre-allocate more vectors/arrays so far the results of the simulation changed, so my progress on this front is limited. I need better unit tests.

Ah, I see. The code is quite complicated. I think I would have tried Bumper.jl for reducing the number of temporary array allocations within innermost loops and various low-level functions first.
I.e. instead of e.g.

for iT in 1:(nT-1)
    ...
    fw = .!nw
    ...
end

you do

@no_escape for iT in 1:(nT-1)
    ...
    fw = @alloc(Bool, length(nw))
    fw .= .!nw
    ...
end

This will be a change which is local to the loop, so it’s fairly easy to ensure there is no semantic change. If needed, it can be done with one array at a time. However, arrays allocated in this way cannot be used outside the @no_escape block. It’s not registered with the GC, it just goes very quickly away when the @no_escape block finishes, and is reused for other @allocs. Arrays which are returned from a function can’t be allocated in this way, but they can be allocated by the caller with @alloc, and passed to the function for filling in, if possible.

It’s a special array type UnsafeArray, so you can’t call a function which explicitly needs Array, but if it can take a DenseArray or AbstractArray it’ll work fine. I.e. most array functions from Base and broadcasting will work just like before.

There is potentially a huge speedup by reducing allocations in this way. In particular in parallel code. Usually much more than what “time used in GC” indicates. There’s an example in Thread-Safe Storage · OhMyThreads.jl

EDIT: The @no_escape should be inside the for loop:

for ...
    @no_escape begin
    ...
    end
end

Otherwise, the @allocs will accumulate over the whole for loop, potentially running out of memory.

1 Like