GC occurs at the worst time in tight loop (Garbage Collection)

That’s not really surprising. Let’s think through what is happening.

We are in a top-level/repl context. That means we are in the interpreter.
The interpreter works over boxed values.

julia> @code_llvm x+y
;  @ REPL[2]:1 within `+`
define void @"julia_+_172"([3 x double]* noalias nocapture noundef nonnull sret([3 x double]) align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %2) #0 {
top:
; ┌ @ Base.jl:37 within `getproperty`
   %3 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 2
   %4 = getelementptr inbounds [3 x double], [3 x double]* %2, i64 0, i64 2
; └
;  @ REPL[2]:1 within `+` @ float.jl:409
  %unbox4 = load double, double* %3, align 8
  %unbox5 = load double, double* %4, align 8
  %5 = fadd double %unbox4, %unbox5
  %6 = bitcast [3 x double]* %1 to <2 x double>*
  %7 = load <2 x double>, <2 x double>* %6, align 8
  %8 = bitcast [3 x double]* %2 to <2 x double>*
  %9 = load <2 x double>, <2 x double>* %8, align 8
  %10 = fadd <2 x double> %7, %9
;  @ REPL[2]:1 within `+`
  %11 = bitcast [3 x double]* %0 to <2 x double>*
  store <2 x double> %10, <2 x double>* %11, align 8
  %newstruct.sroa.3.0..sroa_idx7 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 2
  store double %5, double* %newstruct.sroa.3.0..sroa_idx7, align 8
  ret void
}

No allocation in sight! But the interpreter really can’t work with raw values and this function is working with raw values. Julia generates two functions with two different ABIs

julia> @code_llvm dump_module=true x+y
; ModuleID = '+'
source_filename = "+"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

;  @ REPL[2]:1 within `+`
define void @"julia_+_189"([3 x double]* noalias nocapture noundef nonnull sret([3 x double]) align 8 dereferenceable(24) %0, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %1, [3 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(24) %2) #0 {
top:
; ┌ @ Base.jl:37 within `getproperty`
   %3 = getelementptr inbounds [3 x double], [3 x double]* %1, i64 0, i64 2
   %4 = getelementptr inbounds [3 x double], [3 x double]* %2, i64 0, i64 2
; └
;  @ REPL[2]:1 within `+` @ float.jl:409
  %unbox4 = load double, double* %3, align 8
  %unbox5 = load double, double* %4, align 8
  %5 = fadd double %unbox4, %unbox5
  %6 = bitcast [3 x double]* %1 to <2 x double>*
  %7 = load <2 x double>, <2 x double>* %6, align 8
  %8 = bitcast [3 x double]* %2 to <2 x double>*
  %9 = load <2 x double>, <2 x double>* %8, align 8
  %10 = fadd <2 x double> %7, %9
;  @ REPL[2]:1 within `+`
  %11 = bitcast [3 x double]* %0 to <2 x double>*
  store <2 x double> %10, <2 x double>* %11, align 8
  %newstruct.sroa.3.0..sroa_idx7 = getelementptr inbounds [3 x double], [3 x double]* %0, i64 0, i64 2
  store double %5, double* %newstruct.sroa.3.0..sroa_idx7, align 8
  ret void
}

; Function Attrs: noinline optnone
define nonnull {}* @"jfptr_+_190"({}* %function, {}** noalias nocapture noundef readonly %args, i32 %nargs) #1 {
top:
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"()
  %tls_ppgcstack = getelementptr i8, i8* %thread_ptr, i64 -8
  %0 = bitcast i8* %tls_ppgcstack to {}****
  %tls_pgcstack = load {}***, {}**** %0, align 8
  %sret = alloca [3 x double], align 8
  %1 = getelementptr inbounds {}*, {}** %args, i32 0
  %2 = load {}*, {}** %1, align 8
  %3 = bitcast {}* %2 to [3 x double]*
  %4 = getelementptr inbounds {}*, {}** %args, i32 1
  %5 = load {}*, {}** %4, align 8
  %6 = bitcast {}* %5 to [3 x double]*
  call void @"julia_+_189"([3 x double]* noalias nocapture noundef sret([3 x double]) %sret, [3 x double]* nocapture readonly %3, [3 x double]* nocapture readonly %6)
  %7 = bitcast {}*** %tls_pgcstack to {}**
  %current_task = getelementptr inbounds {}*, {}** %7, i64 -14
  %ptls_field = getelementptr inbounds {}*, {}** %current_task, i64 16
  %ptls_load = load {}*, {}** %ptls_field, align 8
  %ptls = bitcast {}* %ptls_load to {}**
  %8 = bitcast {}** %ptls to i8*
  %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc(i8* %8, i32 1184, i32 32) #7
  %9 = bitcast {}* %box to i64*
  %10 = getelementptr inbounds i64, i64* %9, i64 -1
  store atomic i64 140457949127632, i64* %10 unordered, align 8
  %11 = bitcast {}* %box to i8*
  %12 = bitcast [3 x double]* %sret to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %11, i8* %12, i64 24, i1 false)
  ret {}* %box
}

The job of jfptr_+_190 is to translate from boxed values to raw values and back,
and indeed for the return argument it needs to call %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc(i8* %8, i32 1184, i32 32) #7 to create a box of 32 bytes.

That’s the 32 bytes @allocated reports. Why 32? it’s 8 byte tag + 3x8 data

julia> function dostuff(x1,x2)
               y = x1+x2
               return y.a+y.b+y.c
       end
julia> @allocated dostuff(x, y)
16

8bytes tag + 8byte value

Julia generally has two calling conventions (actually a few more, but only these two matter here) one being “fast” and working directly with raw values, and one being “slow” and working with boxed values.

This second one is the one I referred to in I can't help but think debugging memory allocations shouldn't be this hard - #18 by vchuravy

If we don’t have enough information locally (because one of the arguments is type-unstable) we will generally fall back to a slower calling convention and that can cause allocations at the call site (and for the return value).

3 Likes