Request information about GC implementation

Hi All,

Could you please tell me, to implement the garbage collector do you use the LLVM GC framework support?

Also I saw in some documentations there is a method called GC.gc() to manually call the gc. Could you please point me to the source code of that method? (I tried to find using grep)

Thank you,
Kavindu

julia> @which GC.gc()
gc() in Base.GC at gcutils.jl:94
4 Likes

The GC used in Julia is an accurate GC or a conservative GC?

Accurate, generational, non-concurrent (i.e. stop-the-world), non-relocating. It does not use the llvm framework.

5 Likes

How does Julia identify the heap references(which are going to mark(objects) in mark face) in the current stack?

In the julia code this is called β€œgc rooting”. The general idea is that there is a second stack (β€œthe shadow stack”) that contains gc roots, i.e. local object references that must be visible to the gc and cannot be proven otherwise visible. This is implemented via a mix of C-API (when you write C code that interfaces with the julia runtime), custom llvm passes, and code lowering.

To see this in action, consider

julia> function foo(r)
       @inbounds rr = r[1]
       push!(rr, 1)
       end;

Here, the push! will call into the runtime, and codegen/llvm has no understanding what this foreign call will actually do. Hence, we need to keep rr alive (r is kept alive by out caller – but it could be that the foreign function mutates r to remove the reference to rr and then calls the garbage collector).

julia> @code_llvm foo([[1]])
;  @ REPL[23]:1 within `foo'
define nonnull {}* @japi1_foo_309({}* %0, {}** %1, i32 %2) #0 {
top:
  %gcframe2 = alloca [3 x {}*], align 16
  %gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
  %3 = bitcast [3 x {}*]* %gcframe2 to i8*
  call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %3, i8 0, i32 24, i1 false)
  %4 = alloca {}**, align 8
  store volatile {}** %1, {}*** %4, align 8
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #6
  %ptls_i8 = getelementptr i8, i8* %thread_ptr, i64 -32768
  %5 = bitcast [3 x {}*]* %gcframe2 to i64*
  store i64 4, i64* %5, align 16
  %6 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
  %7 = bitcast i8* %ptls_i8 to i64*
  %8 = load i64, i64* %7, align 8
  %9 = bitcast {}** %6 to i64*
  store i64 %8, i64* %9, align 8
  %10 = bitcast i8* %ptls_i8 to {}***
  store {}** %gcframe2.sub, {}*** %10, align 8
  %11 = bitcast {}** %1 to {}****
  %12 = load {}***, {}**** %11, align 8
;  @ REPL[23]:2 within `foo'
; β”Œ @ array.jl:801 within `getindex'
   %13 = load {}**, {}*** %12, align 8
   %14 = load atomic {}*, {}** %13 unordered, align 8
   %.not = icmp eq {}* %14, null
   br i1 %.not, label %fail, label %pass

fail:                                             ; preds = %top
   call void @jl_throw({}* inttoptr (i64 140358161122320 to {}*))
   unreachable

pass:                                             ; preds = %top
   %15 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
   store {}* %14, {}** %15, align 16
; β””
;  @ REPL[23]:3 within `foo'
; β”Œ @ array.jl:929 within `push!'
; β”‚β”Œ @ array.jl:884 within `_growend!'
    call void inttoptr (i64 140358397421246 to void ({}*, i64)*)({}* nonnull %14, i64 1)
; β””β””
; β”Œ @ array.jl:930 within `push!'
; β”‚β”Œ @ abstractarray.jl:347 within `lastindex'
; β”‚β”‚β”Œ @ abstractarray.jl:312 within `eachindex'
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:109 within `axes1'
; β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:89 within `axes'
; β”‚β”‚β”‚β”‚β”‚β”Œ @ array.jl:133 within `size'
        %16 = bitcast {}* %14 to {}**
        %17 = getelementptr inbounds {}*, {}** %16, i64 3
        %18 = bitcast {}** %17 to i64*
        %19 = load i64, i64* %18, align 8
; β”‚β””β””β””β””β””
; β”‚β”Œ @ array.jl:839 within `setindex!'
    %20 = add nsw i64 %19, -1
    %21 = bitcast {}* %14 to { i8*, i64, i16, i16, i32 }*
    %22 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %21, i64 0, i32 1
    %23 = load i64, i64* %22, align 8
    %24 = icmp ult i64 %20, %23
    br i1 %24, label %idxend, label %oob

oob:                                              ; preds = %pass
    %25 = alloca i64, align 8
    store i64 %19, i64* %25, align 8
    call void @jl_bounds_error_ints({}* %14, i64* nonnull %25, i64 1)
    unreachable

idxend:                                           ; preds = %pass
    %26 = bitcast {}* %14 to i64**
    %27 = load i64*, i64** %26, align 8
    %28 = getelementptr inbounds i64, i64* %27, i64 %20
    store i64 1, i64* %28, align 8
    %29 = load i64, i64* %9, align 8
    store i64 %29, i64* %7, align 8
; β””β””
  ret {}* %14
}

The julia GC has no read barriers, but it needs a write barrier (to support the case where an old object is mutated to contain a reference to a young object):

julia> bar(a, b) = begin a[]=b; nothing; end;

julia> @code_llvm bar(Ref([1]), [1])
;  @ REPL[30]:1 within `bar'
define nonnull {}* @japi1_bar_498({}* %0, {}** %1, i32 %2) #0 {
top:
  %3 = alloca {}**, align 8
  store volatile {}** %1, {}*** %3, align 8
  %4 = load {}*, {}** %1, align 8
  %5 = getelementptr inbounds {}*, {}** %1, i64 1
  %6 = load {}*, {}** %5, align 8
; β”Œ @ refvalue.jl:57 within `setindex!'
; β”‚β”Œ @ Base.jl:34 within `setproperty!'
    %7 = bitcast {}* %4 to {}**
    store atomic {}* %6, {}** %7 unordered, align 8
    %8 = bitcast {}* %4 to i64*
    %9 = getelementptr inbounds i64, i64* %8, i64 -1
    %10 = load atomic i64, i64* %9 unordered, align 8
    %11 = and i64 %10, 3
    %12 = icmp eq i64 %11, 3
    br i1 %12, label %13, label %20

13:                                               ; preds = %top
    %14 = bitcast {}* %6 to i64*
    %15 = getelementptr inbounds i64, i64* %14, i64 -1
    %16 = load atomic i64, i64* %15 unordered, align 8
    %17 = and i64 %16, 1
    %18 = icmp eq i64 %17, 0
    br i1 %18, label %19, label %20

19:                                               ; preds = %13
    call void @jl_gc_queue_root({}* nonnull %4)
    br label %20

20:                                               ; preds = %19, %13, %top
; β””β””
  ret {}* inttoptr (i64 140358107785408 to {}*)
}

Some effort has been spent to improve loop hoisting of the write barrier, but I am in no position to properly evaluate how good we are with this.

There is a known performance pitfall with respect to multithreading and GC: GC can only be started at safepoints (allocations, foreign function calls, explicit safepoints). The shadow stack only needs to be in a correct state at safepoints (this helps performance of loops). Tight optimized loops contain no safepoints.

You see the problem: One thread enters a tight highly optimized long loop that doesn’t hit a safepoint for several seconds, until it has finished munching through gigabytes of computation; in the meantime all other threads stall, waiting for an opportunity where all threads hit a safepoint and GC can start.

edit: improved write barrier example

For future work, it would be super cool to switch to using stack unwinding instead, possibly hook into the register spilling logic, and generally get rid of safepoints (i.e. use an interrupt to stop the world). However, this is unlikely to happen in the next couple of years. Another interesting idea is what the JVM does: implement safepoints by a load instruction that traps if we are supposed to yield to GC (i.e. load from a global address; when we want to stop the world, mark the page as non-readable and wait until all threads have hit our fault handler). This has the advantage that safepoints become very cheap and can be sprinkled into a lot more places.

11 Likes

Is one of the options similar to the Go gc implementation, designed to ensure consistently low latency?

Can you point me the source code of generating this second stack and putting heap references on that stack?