GC occurs at the worst time in tight loop (Garbage Collection)

That’s not really surprising. Let’s think through what is happening.

We are in a top-level/repl context. That means we are in the interpreter.
The interpreter works over boxed values.

julia> @code_llvm x+y
;  @ REPL[2]:1 within `+`
define void @"julia_+_172"([3 x double]* noalias nocapture noundef nonnull sret([3 x double]) align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %2) #0 {
top:
; ┌ @ Base.jl:37 within `getproperty`
   %3 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 2
   %4 = getelementptr inbounds [3 x double], [3 x double]* %2, i64 0, i64 2
; └
;  @ REPL[2]:1 within `+` @ float.jl:409
  %unbox4 = load double, double* %3, align 8
  %unbox5 = load double, double* %4, align 8
  %5 = fadd double %unbox4, %unbox5
  %6 = bitcast [3 x double]* %1 to <2 x double>*
  %7 = load <2 x double>, <2 x double>* %6, align 8
  %8 = bitcast [3 x double]* %2 to <2 x double>*
  %9 = load <2 x double>, <2 x double>* %8, align 8
  %10 = fadd <2 x double> %7, %9
;  @ REPL[2]:1 within `+`
  %11 = bitcast [3 x double]* %0 to <2 x double>*
  store <2 x double> %10, <2 x double>* %11, align 8
  %newstruct.sroa.3.0..sroa_idx7 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 2
  store double %5, double* %newstruct.sroa.3.0..sroa_idx7, align 8
  ret void
}

No allocation in sight! But the interpreter really can’t work with raw values and this function is working with raw values. Julia generates two functions with two different ABIs

julia> @code_llvm dump_module=true x+y
; ModuleID = '+'
source_filename = "+"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

;  @ REPL[2]:1 within `+`
define void @"julia_+_189"([3 x double]* noalias nocapture noundef nonnull sret([3 x double]) align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %2) #0 {
top:
; ┌ @ Base.jl:37 within `getproperty`
   %3 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 2
   %4 = getelementptr inbounds [3 x double], [3 x double]* %2, i64 0, i64 2
; └
;  @ REPL[2]:1 within `+` @ float.jl:409
  %unbox4 = load double, double* %3, align 8
  %unbox5 = load double, double* %4, align 8
  %5 = fadd double %unbox4, %unbox5
  %6 = bitcast [3 x double]* %1 to <2 x double>*
  %7 = load <2 x double>, <2 x double>* %6, align 8
  %8 = bitcast [3 x double]* %2 to <2 x double>*
  %9 = load <2 x double>, <2 x double>* %8, align 8
  %10 = fadd <2 x double> %7, %9
;  @ REPL[2]:1 within `+`
  %11 = bitcast [3 x double]* %0 to <2 x double>*
  store <2 x double> %10, <2 x double>* %11, align 8
  %newstruct.sroa.3.0..sroa_idx7 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 2
  store double %5, double* %newstruct.sroa.3.0..sroa_idx7, align 8
  ret void
}

; Function Attrs: noinline optnone
define nonnull {}* @"jfptr_+_190"({}* %function, {}** noalias nocapture noundef readonly %args, i32 %nargs) #1 {
top:
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"()
  %tls_ppgcstack = getelementptr i8, i8* %thread_ptr, i64 -8
  %0 = bitcast i8* %tls_ppgcstack to {}****
  %tls_pgcstack = load {}***, {}**** %0, align 8
  %sret = alloca [3 x double], align 8
  %1 = getelementptr inbounds {}*, {}** %args, i32 0
  %2 = load {}*, {}** %1, align 8
  %3 = bitcast {}* %2 to [3 x double]*
  %4 = getelementptr inbounds {}*, {}** %args, i32 1
  %5 = load {}*, {}** %4, align 8
  %6 = bitcast {}* %5 to [3 x double]*
  call void @"julia_+_189"([3 x double]* noalias nocapture noundef sret([3 x double]) %sret, [3 x double]* nocapture readonly %3, [3 x double]* nocapture readonly %6)
  %7 = bitcast {}*** %tls_pgcstack to {}**
  %current_task = getelementptr inbounds {}*, {}** %7, i64 -14
  %ptls_field = getelementptr inbounds {}*, {}** %current_task, i64 16
  %ptls_load = load {}*, {}** %ptls_field, align 8
  %ptls = bitcast {}* %ptls_load to {}**
  %8 = bitcast {}** %ptls to i8*
  %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc(i8* %8, i32 1184, i32 32) #7
  %9 = bitcast {}* %box to i64*
  %10 = getelementptr inbounds i64, i64* %9, i64 -1
  store atomic i64 140457949127632, i64* %10 unordered, align 8
  %11 = bitcast {}* %box to i8*
  %12 = bitcast [3 x double]* %sret to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %11, i8* %12, i64 24, i1 false)
  ret {}* %box
}

The job of jfptr_+_190 is to translate from boxed values to raw values and back,
and indeed for the return argument it needs to call %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc(i8* %8, i32 1184, i32 32) #7 to create a box of 32 bytes.

That’s the 32 bytes @allocated reports. Why 32? it’s 8 byte tag + 3x8 data

julia> function dostuff(x1,x2)
               y = x1+x2
               return y.a+y.b+y.c
       end
julia> @allocated dostuff(x, y)
16

8bytes tag + 8byte value

Julia generally has two calling conventions (actually a few more, but only these two matter here) one being “fast” and working directly with raw values, and one being “slow” and working with boxed values.

This second one is the one I referred to in I can't help but think debugging memory allocations shouldn't be this hard - #18 by vchuravy

If we don’t have enough information locally (because one of the arguments is type-unstable) we will generally fall back to a slower calling convention and that can cause allocations at the call site (and for the return value).

3 Likes

Ok. I was wondering if something like that was the problem with my mwe.
I’ll stick it in a module and do another test to see If I can recreate the problem when I get back to my computer.

I’m wondering if it is the size of my static structure in the array is the problem. Each one is basically 3 complex
Numbers and a float. I’m storing them as float64s. I wonder if it would do something different if I used float32s.

#allocation test

module allocation_test

    struct mystruct1{T<:Real}
        a::T
        b::T
        c::T
        end

	@inline Base.:+(a::mystruct1{T},b::mystruct1{T}) where T<:Real=mystruct1{T}(a.a+b.a,a.b+b.b,a.c+b.c)    

    const f=Ref{Array{mystruct1{Float64}}}()

	f[]=[mystruct1{Float64}(i,j,k) for i=1:10,j=1:10,k=1:3];

    
    function test1(x::mystruct1{T},y::mystruct1{T}) where T<:Real
        x+y
    end

    function test2(x::mystruct1{T},y::mystruct1{T}) where T<:Real
        x.a*y.a + x.b *y.b + x.c * y.c
    end


end


julia> @allocated g.test1(g.f[][1,1,1],g.f[][2,2,2]) 96

julia> @code_llvm g.test1(g.f[][1,1,1],g.f[][2,2,2])
;  @ C:\mypath\allocation_test.jl:18 within `test1`
; Function Attrs: uwtable
define void @julia_test1_3144([3 x double]* noalias nocapture noundef nonnull sret([3 x double]) align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %2) #0 {
top:
;  @ C:\mypath\allocation_test.jl:19 within `test1`
; ┌ @ C:\mypath\allocation_test.jl:11 within `+` @ float.jl:408
   %3 = bitcast [3 x double]* %1 to <2 x double>*
   %4 = load <2 x double>, <2 x double>* %3, align 8
   %5 = bitcast [3 x double]* %2 to <2 x double>*
   %6 = load <2 x double>, <2 x double>* %5, align 8
   %7 = fadd <2 x double> %4, %6
; │ @ C:\mypath\allocation_test.jl:11 within `+`
; │┌ @ Base.jl:37 within `getproperty`
    %8 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 2
    %9 = getelementptr inbounds [3 x double], [3 x double]* %2, i64 0, i64 2
; │└
; │ @ C:\mypath\allocation_test.jl:11 within `+` @ float.jl:408
   %10 = load double, double* %8, align 8
   %11 = load double, double* %9, align 8
   %12 = fadd double %10, %11
; └
  %13 = bitcast [3 x double]* %0 to <2 x double>*
  store <2 x double> %7, <2 x double>* %13, align 8
  %.sroa.3.0..sroa_idx2 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 2
  store double %12, double* %.sroa.3.0..sroa_idx2, align 8
  ret void
}
julia> @code_llvm g.test2(g.f[][1,1,1],g.f[][2,2,2])
;  @ C:\mypath\allocation_test.jl:22 within `test2`
; Function Attrs: uwtable
define double @julia_test2_3227([3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1) #0 {
top:
;  @ C:\mypath\allocation_test.jl:23 within `test2`
; ┌ @ Base.jl:37 within `getproperty`
   %2 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 0
   %3 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 0
; └
; ┌ @ float.jl:410 within `*`
   %4 = load double, double* %2, align 8
   %5 = load double, double* %3, align 8
   %6 = fmul double %4, %5
; └
; ┌ @ Base.jl:37 within `getproperty`
   %7 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 1
   %8 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 1
; └
; ┌ @ float.jl:410 within `*`
   %9 = bitcast double* %7 to <2 x double>*
   %10 = load <2 x double>, <2 x double>* %9, align 8
   %11 = bitcast double* %8 to <2 x double>*
   %12 = load <2 x double>, <2 x double>* %11, align 8
   %13 = fmul <2 x double> %10, %12
; └
; ┌ @ operators.jl:578 within `+` @ float.jl:408
   %14 = extractelement <2 x double> %13, i64 0
   %15 = fadd double %6, %14
   %16 = extractelement <2 x double> %13, i64 1
   %17 = fadd double %15, %16
; └
  ret double %17
}

julia> @allocated g.test2(g.f[][1,1,1],g.f[][2,2,2]) = 80

So why is this not zero allocation?

This is not concrete, because the dimension of the array is not defined.

3 Likes

Oooooo… that is interesting. I can put the dimensions in the Type Parameters right?

I do seem to get allocations in my MWE as well, and it doesn’t have a Array that is part of another structure.

So maybe this is true, but not the issue. I pass the array in to the function without the object. And in my MWE, I don’t even pass the array in.

with the dump_module=true set.

julia> @code_llvm dump_module=true g.test2(g.f[][1,1,1],g.f[][2,2,2])
; ModuleID = 'test2'
source_filename = "test2"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-w64-windows-gnu-elf"

;  @ C:\mypath\allocation_test.jl:22 within `test2`
; Function Attrs: uwtable
define double @julia_test2_3533([3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1) #0 {
top:
;  @ C:\mypath\allocation_test.jl:23 within `test2`
; ┌ @ Base.jl:37 within `getproperty`
   %2 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 0
   %3 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 0
; └
; ┌ @ float.jl:410 within `*`
   %4 = load double, double* %2, align 8
   %5 = load double, double* %3, align 8
   %6 = fmul double %4, %5
; └
; ┌ @ Base.jl:37 within `getproperty`
   %7 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 1
   %8 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 1
; └
; ┌ @ float.jl:410 within `*`
   %9 = bitcast double* %7 to <2 x double>*
   %10 = load <2 x double>, <2 x double>* %9, align 8
   %11 = bitcast double* %8 to <2 x double>*
   %12 = load <2 x double>, <2 x double>* %11, align 8
   %13 = fmul <2 x double> %10, %12
; └
; ┌ @ operators.jl:578 within `+` @ float.jl:408
   %14 = extractelement <2 x double> %13, i64 0
   %15 = fadd double %6, %14
   %16 = extractelement <2 x double> %13, i64 1
   %17 = fadd double %15, %16
; └
  ret double %17
}

; Function Attrs: uwtable
define nonnull {}* @jfptr_test2_3534({}* %0, {}** noalias nocapture noundef readonly %1, i32 %2) #0 {
top:
  %3 = call {}*** inttoptr (i64 140729193168320 to {}*** ()*)() #4
  %4 = bitcast {}** %1 to [3 x double]**
  %5 = load [3 x double]*, [3 x double]** %4, align 8
  %6 = getelementptr inbounds {}*, {}** %1, i64 1
  %7 = bitcast {}** %6 to [3 x double]**
  %8 = load [3 x double]*, [3 x double]** %7, align 8
  %9 = call double @julia_test2_3533([3 x double]* nocapture readonly %5, [3 x double]* nocapture readonly %8) #0
  %ptls_field2 = getelementptr inbounds {}**, {}*** %3, i64 2
  %10 = bitcast {}*** %ptls_field2 to i8**
  %ptls_load34 = load i8*, i8** %10, align 8
  %11 = call noalias nonnull {}* @ijl_gc_pool_alloc(i8* %ptls_load34, i32 1392, i32 16) #1
  %12 = bitcast {}* %11 to i64*
  %13 = getelementptr inbounds i64, i64* %12, i64 -1
  store atomic i64 140728335325888, i64* %13 unordered, align 8
  %14 = bitcast {}* %11 to double*
  store double %9, double* %14, align 8
  ret {}* %11
}

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @julia.gc_alloc_obj({}**, i64, {}*) #1

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #2

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #2

; Function Attrs: inaccessiblemem_or_argmemonly
declare void @ijl_gc_queue_root({}*) #3

; Function Attrs: inaccessiblemem_or_argmemonly
declare void @jl_gc_queue_binding({}*) #3

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @ijl_gc_pool_alloc(i8*, i32, i32) #1

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @ijl_gc_big_alloc(i8*, i64) #1

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @ijl_gc_alloc_typed(i8*, i64, i8*) #1

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @julia.gc_alloc_bytes(i8*, i64) #1

attributes #0 = { uwtable "frame-pointer"="all" }
attributes #1 = { allocsize(1) }
attributes #2 = { argmemonly nofree nosync nounwind willreturn }
attributes #3 = { inaccessiblemem_or_argmemonly }
attributes #4 = { nounwind readnone }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}

julia> 

These are just benchmarking artifacts from the fact that the macro must return the values to the REPL:

julia> import .allocation_test as g

julia> @allocated g.test1(g.f[][1,1,1],g.f[][2,2,2])
96

julia> using BenchmarkTools

julia> @ballocated g.test1($(g.f[][1,1,1]),$(g.f[][2,2,2]))
0

You sould use Array{T,3}, for example.

1 Like

That is tricky. I’m going to try the Array{T,4} mod since my MWE now works.

Winner winner chicken dinner!

1 Like

Thank everyone for your contributions! I have learned about 2 dozen new things on this thread. You all get 5-star reviews from me.

Now if you manage to write all of those up in a blog post xD
Kidding aside glad that you figured it out.

1 Like

The definition of the Array was not in my high-rate loop. That was the only thing I checked with JET. I didn’t check the data initialization functions, so it didn’t discover the type instability of the Array with no dimensions. I think I will try to run it on the init functions as well and see what else shows up.

Tagging on. One thing we notice, is that the GC keeps kicking in periodically even though there are no longer allocations happening. When we force GC to clear out any left over loading of data dereferencing, the GC happens more quickly but it still periodically checks.

We attempt to disable the GC but it looks like it still periodically is taking time checking something. Is there a process to make this explicitly stop?

This is happening when calling shared C-function libraries on linux that we have used the package compiler to build.

Try running with GC.enable_logging(true)

The work GC is doing is proportional to the live heap, not to the garbage.
So if it get’s triggered it will do a heap scan.

How did you measure?

It is being measured on the C-side. I changed some function names and this is just the relevant lines. When I test on the julia side calling the same functions. I don’t see a problem.

I guess your suggesting setting that log flag so it is called by the C-library.

When we trigger the GC manually and frequently we don’t have big run-time spikes and we can meet timing. But it is strange that that is needed. Disabling GC seems to not stop the spikes, and I think make things a little worse.

Could the julia C shared libraries to facilitate the julia c-calls be generating some allocations?

        clock_gettime(CLOCK_REALTIME, &tic);

        julia_setup(1000.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0);
        ret=julia_calculate();
        ret = julia_get_mag_phase(mag,phase,arraylen);

        clock_gettime(CLOCK_REALTIME, &toc);

        deltat = toc.tv_sec + toc.tv_nsec*1e-9-(tic.tv_sec + tic.tv_nsec*1e-9);
        totalt += deltat;

Also, our integration engineer just told me on the real-time software they run a memory lock, which he also put in his test program. When it is in there, it makes GC much worse. When it isn’t, it is faster.

// // Prevent page faults by pulling data into memory now
    // if (mlockall(MCL_CURRENT | MCL_FUTURE))
    // {
    //     throw std::runtime_error("Can't lock memory, do you have sufficient privileges?");
    // }

Did you give Julia a hint?

–heap-size-hint

So this is interesting, but that is at a point where it is very hard to help debug this without an MWE. If GC is truly disabled you should see no variance.

1 Like

We did not specify a hint. These a C-shared libraries compiled with PackageCompiler.jl. I’m not sure how to do that for those. BUT, there is a setup phase which requires loading JLD2 files or even CSV files if that option is chosen. These would be expected to allocate on the Heap along with the database.

The tight realtime loop expects zero allocations.

Yeah, I’m wondering if I have a library problem between my development environment on Windows and the centos linux the C-libs are running on the real-time machines. I have a ubunut linux install on the same machine as the windows. I’m going to do more testing there. First step is if I can duplicate the problem in the julia only code, vs the shared library code. Perhaps I’ll need to ccall back to the compiled code and make sure it is all the same.

In the C structure, jl_options_t, the last field is heap_size_hint.

This controls when greedy garbage collection starts to occur, which seems directly related to your issue.

I’m unclear if it would be better to raise or lower this value in your case.

This value influences when GC starts kicking in:

By default this is set to 90% of available memory. You can obtain the value that this was set to by evaluating the following:

Int(ccall(:jl_gc_get_max_memory, UInt64, ()))/ 1024^3
2 Likes