GC occurs at the worst time in tight loop (Garbage Collection)

I have julia code which I’ve compiled into a C-library. We intend to run this on a real-time system to test some hardware. When I run a test on the function call in julia using the BenchmarkTools. The @benchmarks look fairly good with 6-us +/- 12us execution times. That is giving me 4%+/-2.2% garbage collection. But the Max time can be 592-us! with 98.17% GC. I’m not sure if that is like the first time it runs or something… very odd.

I’ve done some good whittling at the 4.12 kBs with 68 allocations to try to get rid of the GC all together. I do this by using the module level to hold const Ref{}'s to hold onto reusable mutable structures and databases which are passed into the functions. I have some composite types which I don’t think should allocate, which still end up allocating so I’m trying to understand that as well.

My main question, is one of the data sources that is accessed is a look-up table stored as a 4D array of non-mutable composite types. When this data is accessed, even though it is never written to in the high-rate loop, the garbage collection always runs according to the Profiler flame chart by view_profile().

It is being triggered by line 14 in essentials.jl

eval(:(getindex(A::Array, i1::Int, i2::Int, I::Int...) = (@inline; arrayref($(Expr(:boundscheck)), A, i1, i2, I...))))

Is this expected? I access it 4 to 8 times in one execution and according to the Profiler, that garbage collection on every access, right in a row, is always there.

The access on my end looks something like this:

for i=1:N
val[i]=myfunc(data[i,j,k,f],data[i,j+1,k,f],data[i,j+1,k+1,f],data[i,j,k+1,f],...)
end

Why is that? When I had those direct data accesses wrapped with an inline function, I could see that garbage collection was happening on each one of them equally, now that I have changed to this format, it is the same length of time, so I think the same 4-GCs are running at each function call.

Do array accesses always trigger the garbage collection? Is the Reference hiding something and the garbage collector is confused? Is this expected?

The 5-us average isn’t a big deal, so maybe it isn’t a problem for the GC to do this every single time it looks at an array for reading, but that 500-us hit is what concerns me.

Should I restructure my database so that the first column that is looped, is stored as a static array? The last dimension, is never changed inside this function, should I move the last dimension to be a vector of 2D objects which contain StaticArray’s of Composite types? The database doesn’t change after initial loading.

Any ideas to mitigate that 100x increase spike in the GC?

Also, any info on why composite types allocate off the heap instead of the stack, would be great, or how to profile to be able to see why would be appreciated.

Best regards,
Allan Baker

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 12 on 12 virtual cores
Environment:
  JULIA_DIR = C:\Users\username\AppData\Local\Programs\Julia-1.9.2
  JULIA_PKG_DEVDIR = D:\JULIA\DEVELOPMENT
  JULIA_PKG_USE_CLI_GIT = true
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

One other thing to note. I don’t know if it matters. The data array is actually part of a struct (non-mutable). The struct object is what is passed into the ref{}. The array inside the struct is technically still mutable, but I don’t know how to do that any other way. It’s big, so it needs to be on the heap.

There are some functions for manually controlling the garbage collector.

https://docs.julialang.org/en/v1/base/base/#Base.GC.gc
https://docs.julialang.org/en/v1/base/base/#Base.GC.enable

3 Likes

Have you checked whether any run time dispatches happen with JET.jl?

1 Like

any info on why composite types allocate off the heap instead of the stack, would be great

I don’t know, you mean structs (or Unions)? Anyway you may want to look into Bumper.jl. Allocations with it are special, and not a problem if it fits your problem.

No array access, reads should never do that, and writes neither, to regular dense arrays. To sparse or some specialized maybe? If this is happening you are missing out on some extra logic happening.

I misread at first thinking you asked if allocations always trigger GC, and to answer that:

Always potentially I believe (i.e. GC logic is then called but it may be very quick i.e. no collection happening), and collection never happens, if you do not allocate (at least not in the current implementation, I think that has not changed; unless you’re using some other thread that does allocation behind your back), why you want to get rid of all allocations, at least for (hard) real-time (from the main loop; your only thread).

It seems you know how, there are macros in Julia helping, but you may also want to know of an upcoming package:

] add https://github.com/BloodWorkXGaming/TestNoAllocations.jl

It may change name (and add functionality) since I suggested that, but you could use it as is. I’ve not tested it, it was about to be registered and I blocked it. But not because I believe there’s anything wrong with it, maybe rather just the name.

I see one other that may (or may not) help, I think I might see it sometimes helping…:

I think that should be a last resort, i.e. disabling GC. I don’t know of any good use for doing that in real-time code (or otherwise). I would rather use Bumper.jl, if that applies. and if it doesn’t I doubt disabling the GC is valid (for you). See in docs:

Disabling garbage collection should be used only with caution, as it can cause memory use to grow without bound.

And then it will fail eventually, and before that you will swap slowly, and it’s worse for realtime. Also GC will be slower when you do enable again, since it needs to catch up. If you do not enable you will run out of memory. If not because they are finite, e.g. before your loop starts, then no GC will happen after anyway if you do not allocate in that loop, so it would not have mattered if you disabled.

That might be your problem. EDIT: immutables are good (and the default for structs), mutable structs are not, I explained correctly except I had it backwards…:

If you mean you have an array of non-mutable structs (the default kind), then it means implicitly your arrays has pointers to them, and writing to the array will allocate and will give the illusion you’ve thrown away the old data (GC will do it later), i.e. trigger allocation, and thus GC, since you implicitly did ask for that.

When I wrote that you can write to an array and no GC, then there’s a caveat. Replacing the pointer in the array, i.e. writing to the array itself doesn’t do that. This would be more obvious in languages like C or Zig. You usually do not see malloc in Julia, since it’s implied, as in thes case, for the new thing you’re pointing to.

You can construct (an array of) [immutable] structs and if at some point you do nothing other than read from it, i.e. access that way it’s ok for realtime.

OR you can use mutable struct and they will be inlined in your array, and can be overwritten. That’s what I had in mind with accesses ok.

Now immutability is great for correctness (e.g. in Clojure), but it DOES put more pressure on the GC. Mutability IS bad for correctness, shouldn’t be the default in a language, but it has it’s uses e.g. for (hard) real-time.

For soft real-time you can tolerate some GC/allocation pauses (actually heap allocations are a problem, not just deallocations and GC).

You can trigger (full) GC intentionally if you have some times where it might apply or not be as sensitive. You can also use Libc.malloc and free directly (i.e. no GC possibility), if it applies, likely more for soft real-time.

The only “allocation gotcha” I know of when accessing arrays is when inadvertently creating an array in the process… for example, whenever you have something like i:i+3 involved. This is usually solved with using either @view or SA.

As @Palli has mentioned, the state of the art in stack allocation for Julia is Bumper.jl. You can check out my benchmark in the discussion on the repo. It should be just as fast as stack allocation in C, at least in Mason’s computer (I still have to re-run my test using the latest Julia version and figure out if there could be performance differencies in AMD vs Intel) but it is very fast, and avoids the GC.

1 Like

What I am doing if I cannot avoid allocations completely is to call the GC like this:

        t_sim = @elapsed KiteModels.next_step!(kps4, integrator, v_ro=v_ro, dt=dt)
        if t_sim < 0.3*dt
            t_gc_tot += @elapsed GC.gc(false)
        end

Full example: https://github.com/aenarete/KiteSimulators.jl/blob/main/examples/autopilot.jl

It means I have a main loop that has to be executed exactly every 50ms. If I have 70% of this time left after doing my work, then I do a partial garbage collection.
(The time that a simulation step needs varies a lot, so sometimes I have time for some garbage collection…)

2 Likes

This is some very good information. I had not heard of bumper.

Another thing I notice, is on the Base.@ccallable functions where I pass in a Ptr{Cdouble} to allow an array of values to be returned from the function, I create a julia array equivalent with unsafe_wrap, which I thought was not supposed to allocate any memory, but it allocates memory. The memory should already be allocated by the calling function. In this code, the unsafe_wrap lines both allocated 64 bytes, the length of the arrays passed in was 8 doubles. So 64 bytes each.

This seems contrary to everything I’ve seen written. Do I need to update to 1.9.3?? Is this a 1.9.2 bug or something? I received this number by tracking the allocations for the user at julia command line startup.

Base.@ccallable function julia_get_mag_phase(cmag::Ptr{Cdouble},cphase::Ptr{Cdouble},len::Csize_t)::Cint
    if isCalculationValid[]
        try
            outlen=length(outputRef_db_mag[])
            if outlen == len
                mag = unsafe_wrap(Array, cmag, (len,))
                phase = unsafe_wrap(Array, cphase, (len,))
                copy!(mag,outputRef_db_mag[])
                copy!(phase,outputRef_deg_phase[])
            else
                @error "Pointer to Array of length $outlen is required! Function called with len=$len."
                return Cint(1)
            end    
        catch err
            @error "Problem copying arrays from Julia to C: $(err.msg)"
            Base.invokelatest(Base.display_error, Base.catch_stack())
            return Cint(1)
        end
        return Cint(0)
    else
        @error "Results need recalculating first."
        return Cint(1)
    end

end

No, these are not expected, if you are only accessing the indexes. That may point out to some type instabilities, if the return type of the getindex function cannot be inferred.

2 Likes

Composite types that contain mutable (without fixed size) fields are likely heap allocated, or at least will contain a reference to a heap allocated object. If their complete size can be known at compile time, probably they will be fully stack allocated.

That’s why I suggested JET.

1 Like

We still have to allocate the wrapper Array around your pointer that contains the information about dims.

This may be better after the Memory PR lands.

2 Likes

That’s backwards. Immutable struct data is stored inline in the array — the elements are not individually heap-allocated. Mutable structs, on the other hand, are heap-allocated and are stored as pointers in an array (this is necessary for mutable semantics).

4 Likes

Will it make unsafe_wrap allocation free?
In general is there a way to get something like unsafe_warp without allocation and without manually defining interface by using something like unsafe_store?

1 Like

Is that right? The array itself is mutable meaning that a[1] can be changed. How could it change that value if it was an immutable struct? Instead I’d expect a[1] to be a pointer and if you want to change the value you allocate a new immutable struct and put the pointer into the array.

On the other hand if the array is of mutable structs, then it’s no big deal to have this data allocated in the array, if you change it, that’s ok since it’s mutable.

In other words it makes sense to me that arrays of mutable structs can be allocated inline but arrays of immutable structs wind up as pointers to immutable values on the heap.

All primitive types and compositions of primitive types are equivalent here (with respect to both mutability and memory layout/size being statically known). As the name says, they’re “just bits”.

Primitive types are also immutable, and yet a variable holding a primitive type can be reassigned any other value (as long as it’s not const):

julia> a = 2  # primitive
2

julia> a = missing  # immutable `struct`, even if not primitive
missing

julia> a = 3  # primitive again

Yes but if a is an array of immutable structs how does

a[1] = Mystruct(1,"foo",2.0)

get implemented other than for a to be secretly an array of pointers, for the new Mystruct to be heap allocated and then a[1] updated to point to the new struct.

Otherwise Julia would have to prove that there’s nothing like

b= Ref(a[1])

So that it could mutate the immutable struct in place without this mutation “leaking”

Again, primitive types are equivalent here, and yet no one thinks that Vector{Int8} holds a pointer for each Int8 value.

Just a note: this isn’t “plain data”/“isbits”, because String is mutable. This is actually explicitly noted in the unreleased docs.

When you call a[1], you’re copying (at least logically) the element of a at index 1, because the elements are immutable/plain data. Thus Ref(a[1]) is equivalent to:

v = a[1]  # copy
Ref(v)    # no relation to `a`
1 Like

Logically, yes, but when the struct has 37 megabytes of data (for example it contains a string with war and peace in it), one hopes not actually. This is the difference between isbits and not isbits. Though I suppose the struct probably just has a pointer to the string… My assumption was that in order to make for an efficient implementation you’d use a pointer, which is easy to copy, so you could do a for loop over i.

for I in 1:10
   d = a[I]
   stuff = summarize(d)
   println(stuff)
end

Without actually copying all the data in a[i]. For example suppose it has 1000 fields and each field is a tuple of 500 numbers, or whatever.

If the struct isbits and is small, or the array is of floats or whatever, it makes perfect sense to just copy the value. If the value is thousands of bytes, it makes more sense to use a pointer I’d think.