GC occurs at the worst time in tight loop (Garbage Collection)

I think you could simplify to:

@noinline function report_error(outlen, len)
    @error "Pointer to Array of length $outlen is required! Function called with len=$len."
end 

@noinline function report_recalculating()
    @error "Results need recalculating first."
end

Base.@ccallable function julia_get_mag_phase(cmag::Ptr{Cdouble},cphase::Ptr{Cdouble},len::Csize_t)::Cint
    if isCalculationValid[]
        db_mag = outputRef_db_mag[]
        deg_phase = outputRef_deg_phase[]
        outlen = db_mag
        if outlen != len
            report_error(outlen, len)
            return Cint(1)
        end
        @inbounds for ii=1:outlen
            unsafe_store!(cmag, db_mag[ii], ii)
            unsafe_store!(cphase, deg_phase[ii], ii)
        end
        return Cint(0)
    else
        report_recalculating()
        return Cint(3)
    end
end

Alternativly. If you are not doing double buffering or something in the C side.

You could “just” Base.unsafe_wrap(Array, cmag; own=false) and use that Array for outputRef_db_mag.

Then your ccallable collapses down to isCalculationValid :wink: and you don’t do the data movement.

This has some really cool insight in it. Thanks!

I need think a bit on this. I’m not sure I understand it, but what I have is not allocating at this point… there… it goes back to the data access I originally proposed.

Ok. State of where I am.
I got rid of all allocations except for the original function call in the original message that accessed the 4D array from the struct. I tried making that struct mutable and it did not matter.

I reworked my code to get rid of the ambiguities at the line.
The allocations are coming from accessing the obj.data array.

This is a read operation. There are no other allocations in any of the functions, and the amount of memory being allocated adds up to all of the calls to these variables.

There was some conversation earlier in this thread, but I couldn’t find the conclusion in all of the discussion.

Here is the trace for the allocations… where I have edited some of the function and variable names just to keep it generic.

0         for ii=1:number
     1024             a=obj.data[ii,j0,k0,ifreq]
     1024             b=obj.data[ii,j1,k0,ifreq]
     1024             c=obj.data[ii,j1,k1,ifreq]
     1024             d=obj.data[ii,j0,k1,ifreq]
        0             result[ii]=fnc_inlined(a,
        -                             b,
        -                             c,
        -                             d,
        -                             t,u)

If you already told me why this make sense, can you tell me again?

While I ponder what is going on here… I have disabled GC right before this line and re-enabled right after. This removes all garbage collection from the @benchmark’s even for a function that calls this 1e6 times.

To make it sustainable, I guess I need to create a little C-call function that refreshes the GC at an idle time, or after the read of the data.

I’m wondering, if I made a, b, c, and d, Ref’s, would that cure all my allocations. It seems a bit hacky.

Thanks again for the tremendous support.

Try using the allocation profiler instad of --trace-allocations

Do you have a standalone MWE for that? Would make it easier to show how I analyze were allocations are coming from.

1 Like

Here is my attempt at a minimally working example. It is strange. When I have done the immutable stuctures in the past, I did not think that there was any allocation and I have lots of functions it seems that do this, but when I make an example, it seems to have similar problems to what I am seeing. Now the + seems to be allocating on the heap. Anyone know where my knowledge gap is?

#allocation test

struct mystruct1{T<:Real}
	a::T
	b::T
	c::T
	end

	@inline Base.:+(a::mystruct1{T},b::mystruct1{T}) where T<:Real=mystruct1{T}(a.a+b.a,a.b+b.b,a.c+b.c)    

	f=[mystruct1{Float64}(i,j,k) for i=1:10,j=1:10,k=1:3];

    x=mystruct1{Float64}(1.0,2.0,3.0)
    y=mystruct1{Float64}(4.0,5.0,6.0)
    
    x+y

    @show @allocated x+y 

    c=Array{mystruct1{Float64}}(undef,1)

    c[1]=x+y
    @show @allocated c[1]=x+y

    @show @allocated c[1]=f[1,1,1]+f[1,2,3]

    


    a=10.0
    b=20.0
    g=[1.0]

    @show @allocated a+b

    @show @allocated g[1]=a+b

    @inline function dostuff(x1,x2)
        x1+x2
    end

    dostuff(a,b)
    dostuff(x,y)
    dostuff(f[1,1,1],f[1,2,3])

    @show @allocated dostuff(a,b)
   @show @allocated dostuff(f[1,1,1],f[1,2,3])

Results of include

include("allocation_test.jl")
#= allocation_test.jl:18 =# @allocated(x + y) = 32
#= allocation_test.jl:23 =# @allocated(c[1] = x + y) = 32
#= allocation_test.jl:25 =# @allocated(c[1] = f[1, 1, 1] + f[1, 2, 3]) = 96
#= allocation_test.jl:34 =# @allocated(a + b) = 16
#= allocation_test.jl:36 =# @allocated(g[1] = a + b) = 16
#= allocation_test.jl:46 =# @allocated(dostuff(a, b)) = 16
#= allocation_test.jl:47 =# @allocated(dostuff(f[1, 1, 1], f[1, 2, 3])) = 96

That’s not really surprising. Let’s think through what is happening.

We are in a top-level/repl context. That means we are in the interpreter.
The interpreter works over boxed values.

julia> @code_llvm x+y
;  @ REPL[2]:1 within `+`
define void @"julia_+_172"([3 x double]* noalias nocapture noundef nonnull sret([3 x double]) align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %2) #0 {
top:
; ┌ @ Base.jl:37 within `getproperty`
   %3 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 2
   %4 = getelementptr inbounds [3 x double], [3 x double]* %2, i64 0, i64 2
; └
;  @ REPL[2]:1 within `+` @ float.jl:409
  %unbox4 = load double, double* %3, align 8
  %unbox5 = load double, double* %4, align 8
  %5 = fadd double %unbox4, %unbox5
  %6 = bitcast [3 x double]* %1 to <2 x double>*
  %7 = load <2 x double>, <2 x double>* %6, align 8
  %8 = bitcast [3 x double]* %2 to <2 x double>*
  %9 = load <2 x double>, <2 x double>* %8, align 8
  %10 = fadd <2 x double> %7, %9
;  @ REPL[2]:1 within `+`
  %11 = bitcast [3 x double]* %0 to <2 x double>*
  store <2 x double> %10, <2 x double>* %11, align 8
  %newstruct.sroa.3.0..sroa_idx7 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 2
  store double %5, double* %newstruct.sroa.3.0..sroa_idx7, align 8
  ret void
}

No allocation in sight! But the interpreter really can’t work with raw values and this function is working with raw values. Julia generates two functions with two different ABIs

julia> @code_llvm dump_module=true x+y
; ModuleID = '+'
source_filename = "+"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

;  @ REPL[2]:1 within `+`
define void @"julia_+_189"([3 x double]* noalias nocapture noundef nonnull sret([3 x double]) align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %2) #0 {
top:
; ┌ @ Base.jl:37 within `getproperty`
   %3 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 2
   %4 = getelementptr inbounds [3 x double], [3 x double]* %2, i64 0, i64 2
; └
;  @ REPL[2]:1 within `+` @ float.jl:409
  %unbox4 = load double, double* %3, align 8
  %unbox5 = load double, double* %4, align 8
  %5 = fadd double %unbox4, %unbox5
  %6 = bitcast [3 x double]* %1 to <2 x double>*
  %7 = load <2 x double>, <2 x double>* %6, align 8
  %8 = bitcast [3 x double]* %2 to <2 x double>*
  %9 = load <2 x double>, <2 x double>* %8, align 8
  %10 = fadd <2 x double> %7, %9
;  @ REPL[2]:1 within `+`
  %11 = bitcast [3 x double]* %0 to <2 x double>*
  store <2 x double> %10, <2 x double>* %11, align 8
  %newstruct.sroa.3.0..sroa_idx7 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 2
  store double %5, double* %newstruct.sroa.3.0..sroa_idx7, align 8
  ret void
}

; Function Attrs: noinline optnone
define nonnull {}* @"jfptr_+_190"({}* %function, {}** noalias nocapture noundef readonly %args, i32 %nargs) #1 {
top:
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"()
  %tls_ppgcstack = getelementptr i8, i8* %thread_ptr, i64 -8
  %0 = bitcast i8* %tls_ppgcstack to {}****
  %tls_pgcstack = load {}***, {}**** %0, align 8
  %sret = alloca [3 x double], align 8
  %1 = getelementptr inbounds {}*, {}** %args, i32 0
  %2 = load {}*, {}** %1, align 8
  %3 = bitcast {}* %2 to [3 x double]*
  %4 = getelementptr inbounds {}*, {}** %args, i32 1
  %5 = load {}*, {}** %4, align 8
  %6 = bitcast {}* %5 to [3 x double]*
  call void @"julia_+_189"([3 x double]* noalias nocapture noundef sret([3 x double]) %sret, [3 x double]* nocapture readonly %3, [3 x double]* nocapture readonly %6)
  %7 = bitcast {}*** %tls_pgcstack to {}**
  %current_task = getelementptr inbounds {}*, {}** %7, i64 -14
  %ptls_field = getelementptr inbounds {}*, {}** %current_task, i64 16
  %ptls_load = load {}*, {}** %ptls_field, align 8
  %ptls = bitcast {}* %ptls_load to {}**
  %8 = bitcast {}** %ptls to i8*
  %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc(i8* %8, i32 1184, i32 32) #7
  %9 = bitcast {}* %box to i64*
  %10 = getelementptr inbounds i64, i64* %9, i64 -1
  store atomic i64 140457949127632, i64* %10 unordered, align 8
  %11 = bitcast {}* %box to i8*
  %12 = bitcast [3 x double]* %sret to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %11, i8* %12, i64 24, i1 false)
  ret {}* %box
}

The job of jfptr_+_190 is to translate from boxed values to raw values and back,
and indeed for the return argument it needs to call %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc(i8* %8, i32 1184, i32 32) #7 to create a box of 32 bytes.

That’s the 32 bytes @allocated reports. Why 32? it’s 8 byte tag + 3x8 data

julia> function dostuff(x1,x2)
               y = x1+x2
               return y.a+y.b+y.c
       end
julia> @allocated dostuff(x, y)
16

8bytes tag + 8byte value

Julia generally has two calling conventions (actually a few more, but only these two matter here) one being “fast” and working directly with raw values, and one being “slow” and working with boxed values.

This second one is the one I referred to in I can't help but think debugging memory allocations shouldn't be this hard - #18 by vchuravy

If we don’t have enough information locally (because one of the arguments is type-unstable) we will generally fall back to a slower calling convention and that can cause allocations at the call site (and for the return value).

3 Likes

Ok. I was wondering if something like that was the problem with my mwe.
I’ll stick it in a module and do another test to see If I can recreate the problem when I get back to my computer.

I’m wondering if it is the size of my static structure in the array is the problem. Each one is basically 3 complex
Numbers and a float. I’m storing them as float64s. I wonder if it would do something different if I used float32s.

#allocation test

module allocation_test

    struct mystruct1{T<:Real}
        a::T
        b::T
        c::T
        end

	@inline Base.:+(a::mystruct1{T},b::mystruct1{T}) where T<:Real=mystruct1{T}(a.a+b.a,a.b+b.b,a.c+b.c)    

    const f=Ref{Array{mystruct1{Float64}}}()

	f[]=[mystruct1{Float64}(i,j,k) for i=1:10,j=1:10,k=1:3];

    
    function test1(x::mystruct1{T},y::mystruct1{T}) where T<:Real
        x+y
    end

    function test2(x::mystruct1{T},y::mystruct1{T}) where T<:Real
        x.a*y.a + x.b *y.b + x.c * y.c
    end


end


julia> @allocated g.test1(g.f[][1,1,1],g.f[][2,2,2]) 96

julia> @code_llvm g.test1(g.f[][1,1,1],g.f[][2,2,2])
;  @ C:\mypath\allocation_test.jl:18 within `test1`
; Function Attrs: uwtable
define void @julia_test1_3144([3 x double]* noalias nocapture noundef nonnull sret([3 x double]) align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %2) #0 {
top:
;  @ C:\mypath\allocation_test.jl:19 within `test1`
; ┌ @ C:\mypath\allocation_test.jl:11 within `+` @ float.jl:408
   %3 = bitcast [3 x double]* %1 to <2 x double>*
   %4 = load <2 x double>, <2 x double>* %3, align 8
   %5 = bitcast [3 x double]* %2 to <2 x double>*
   %6 = load <2 x double>, <2 x double>* %5, align 8
   %7 = fadd <2 x double> %4, %6
; │ @ C:\mypath\allocation_test.jl:11 within `+`
; │┌ @ Base.jl:37 within `getproperty`
    %8 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 2
    %9 = getelementptr inbounds [3 x double], [3 x double]* %2, i64 0, i64 2
; │└
; │ @ C:\mypath\allocation_test.jl:11 within `+` @ float.jl:408
   %10 = load double, double* %8, align 8
   %11 = load double, double* %9, align 8
   %12 = fadd double %10, %11
; └
  %13 = bitcast [3 x double]* %0 to <2 x double>*
  store <2 x double> %7, <2 x double>* %13, align 8
  %.sroa.3.0..sroa_idx2 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 2
  store double %12, double* %.sroa.3.0..sroa_idx2, align 8
  ret void
}
julia> @code_llvm g.test2(g.f[][1,1,1],g.f[][2,2,2])
;  @ C:\mypath\allocation_test.jl:22 within `test2`
; Function Attrs: uwtable
define double @julia_test2_3227([3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1) #0 {
top:
;  @ C:\mypath\allocation_test.jl:23 within `test2`
; ┌ @ Base.jl:37 within `getproperty`
   %2 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 0
   %3 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 0
; └
; ┌ @ float.jl:410 within `*`
   %4 = load double, double* %2, align 8
   %5 = load double, double* %3, align 8
   %6 = fmul double %4, %5
; └
; ┌ @ Base.jl:37 within `getproperty`
   %7 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 1
   %8 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 1
; └
; ┌ @ float.jl:410 within `*`
   %9 = bitcast double* %7 to <2 x double>*
   %10 = load <2 x double>, <2 x double>* %9, align 8
   %11 = bitcast double* %8 to <2 x double>*
   %12 = load <2 x double>, <2 x double>* %11, align 8
   %13 = fmul <2 x double> %10, %12
; └
; ┌ @ operators.jl:578 within `+` @ float.jl:408
   %14 = extractelement <2 x double> %13, i64 0
   %15 = fadd double %6, %14
   %16 = extractelement <2 x double> %13, i64 1
   %17 = fadd double %15, %16
; └
  ret double %17
}

julia> @allocated g.test2(g.f[][1,1,1],g.f[][2,2,2]) = 80

So why is this not zero allocation?

This is not concrete, because the dimension of the array is not defined.

3 Likes

Oooooo… that is interesting. I can put the dimensions in the Type Parameters right?

I do seem to get allocations in my MWE as well, and it doesn’t have a Array that is part of another structure.

So maybe this is true, but not the issue. I pass the array in to the function without the object. And in my MWE, I don’t even pass the array in.

with the dump_module=true set.

julia> @code_llvm dump_module=true g.test2(g.f[][1,1,1],g.f[][2,2,2])
; ModuleID = 'test2'
source_filename = "test2"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-w64-windows-gnu-elf"

;  @ C:\mypath\allocation_test.jl:22 within `test2`
; Function Attrs: uwtable
define double @julia_test2_3533([3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1) #0 {
top:
;  @ C:\mypath\allocation_test.jl:23 within `test2`
; ┌ @ Base.jl:37 within `getproperty`
   %2 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 0
   %3 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 0
; └
; ┌ @ float.jl:410 within `*`
   %4 = load double, double* %2, align 8
   %5 = load double, double* %3, align 8
   %6 = fmul double %4, %5
; └
; ┌ @ Base.jl:37 within `getproperty`
   %7 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 1
   %8 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 1
; └
; ┌ @ float.jl:410 within `*`
   %9 = bitcast double* %7 to <2 x double>*
   %10 = load <2 x double>, <2 x double>* %9, align 8
   %11 = bitcast double* %8 to <2 x double>*
   %12 = load <2 x double>, <2 x double>* %11, align 8
   %13 = fmul <2 x double> %10, %12
; └
; ┌ @ operators.jl:578 within `+` @ float.jl:408
   %14 = extractelement <2 x double> %13, i64 0
   %15 = fadd double %6, %14
   %16 = extractelement <2 x double> %13, i64 1
   %17 = fadd double %15, %16
; └
  ret double %17
}

; Function Attrs: uwtable
define nonnull {}* @jfptr_test2_3534({}* %0, {}** noalias nocapture noundef readonly %1, i32 %2) #0 {
top:
  %3 = call {}*** inttoptr (i64 140729193168320 to {}*** ()*)() #4
  %4 = bitcast {}** %1 to [3 x double]**
  %5 = load [3 x double]*, [3 x double]** %4, align 8
  %6 = getelementptr inbounds {}*, {}** %1, i64 1
  %7 = bitcast {}** %6 to [3 x double]**
  %8 = load [3 x double]*, [3 x double]** %7, align 8
  %9 = call double @julia_test2_3533([3 x double]* nocapture readonly %5, [3 x double]* nocapture readonly %8) #0
  %ptls_field2 = getelementptr inbounds {}**, {}*** %3, i64 2
  %10 = bitcast {}*** %ptls_field2 to i8**
  %ptls_load34 = load i8*, i8** %10, align 8
  %11 = call noalias nonnull {}* @ijl_gc_pool_alloc(i8* %ptls_load34, i32 1392, i32 16) #1
  %12 = bitcast {}* %11 to i64*
  %13 = getelementptr inbounds i64, i64* %12, i64 -1
  store atomic i64 140728335325888, i64* %13 unordered, align 8
  %14 = bitcast {}* %11 to double*
  store double %9, double* %14, align 8
  ret {}* %11
}

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @julia.gc_alloc_obj({}**, i64, {}*) #1

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #2

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #2

; Function Attrs: inaccessiblemem_or_argmemonly
declare void @ijl_gc_queue_root({}*) #3

; Function Attrs: inaccessiblemem_or_argmemonly
declare void @jl_gc_queue_binding({}*) #3

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @ijl_gc_pool_alloc(i8*, i32, i32) #1

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @ijl_gc_big_alloc(i8*, i64) #1

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @ijl_gc_alloc_typed(i8*, i64, i8*) #1

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @julia.gc_alloc_bytes(i8*, i64) #1

attributes #0 = { uwtable "frame-pointer"="all" }
attributes #1 = { allocsize(1) }
attributes #2 = { argmemonly nofree nosync nounwind willreturn }
attributes #3 = { inaccessiblemem_or_argmemonly }
attributes #4 = { nounwind readnone }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}

julia> 

These are just benchmarking artifacts from the fact that the macro must return the values to the REPL:

julia> import .allocation_test as g

julia> @allocated g.test1(g.f[][1,1,1],g.f[][2,2,2])
96

julia> using BenchmarkTools

julia> @ballocated g.test1($(g.f[][1,1,1]),$(g.f[][2,2,2]))
0

You sould use Array{T,3}, for example.

1 Like

That is tricky. I’m going to try the Array{T,4} mod since my MWE now works.

Winner winner chicken dinner!

1 Like

Thank everyone for your contributions! I have learned about 2 dozen new things on this thread. You all get 5-star reviews from me.

Now if you manage to write all of those up in a blog post xD
Kidding aside glad that you figured it out.

1 Like

The definition of the Array was not in my high-rate loop. That was the only thing I checked with JET. I didn’t check the data initialization functions, so it didn’t discover the type instability of the Array with no dimensions. I think I will try to run it on the init functions as well and see what else shows up.

Tagging on. One thing we notice, is that the GC keeps kicking in periodically even though there are no longer allocations happening. When we force GC to clear out any left over loading of data dereferencing, the GC happens more quickly but it still periodically checks.

We attempt to disable the GC but it looks like it still periodically is taking time checking something. Is there a process to make this explicitly stop?

This is happening when calling shared C-function libraries on linux that we have used the package compiler to build.