In the julia code this is called βgc rootingβ. The general idea is that there is a second stack (βthe shadow stackβ) that contains gc roots, i.e. local object references that must be visible to the gc and cannot be proven otherwise visible. This is implemented via a mix of C-API (when you write C code that interfaces with the julia runtime), custom llvm passes, and code lowering.
To see this in action, consider
julia> function foo(r)
@inbounds rr = r[1]
push!(rr, 1)
end;
Here, the push!
will call into the runtime, and codegen/llvm has no understanding what this foreign call will actually do. Hence, we need to keep rr
alive (r
is kept alive by out caller β but it could be that the foreign function mutates r
to remove the reference to rr
and then calls the garbage collector).
julia> @code_llvm foo([[1]])
; @ REPL[23]:1 within `foo'
define nonnull {}* @japi1_foo_309({}* %0, {}** %1, i32 %2) #0 {
top:
%gcframe2 = alloca [3 x {}*], align 16
%gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
%3 = bitcast [3 x {}*]* %gcframe2 to i8*
call void @llvm.memset.p0i8.i32(i8* nonnull align 16 dereferenceable(24) %3, i8 0, i32 24, i1 false)
%4 = alloca {}**, align 8
store volatile {}** %1, {}*** %4, align 8
%thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #6
%ptls_i8 = getelementptr i8, i8* %thread_ptr, i64 -32768
%5 = bitcast [3 x {}*]* %gcframe2 to i64*
store i64 4, i64* %5, align 16
%6 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
%7 = bitcast i8* %ptls_i8 to i64*
%8 = load i64, i64* %7, align 8
%9 = bitcast {}** %6 to i64*
store i64 %8, i64* %9, align 8
%10 = bitcast i8* %ptls_i8 to {}***
store {}** %gcframe2.sub, {}*** %10, align 8
%11 = bitcast {}** %1 to {}****
%12 = load {}***, {}**** %11, align 8
; @ REPL[23]:2 within `foo'
; β @ array.jl:801 within `getindex'
%13 = load {}**, {}*** %12, align 8
%14 = load atomic {}*, {}** %13 unordered, align 8
%.not = icmp eq {}* %14, null
br i1 %.not, label %fail, label %pass
fail: ; preds = %top
call void @jl_throw({}* inttoptr (i64 140358161122320 to {}*))
unreachable
pass: ; preds = %top
%15 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
store {}* %14, {}** %15, align 16
; β
; @ REPL[23]:3 within `foo'
; β @ array.jl:929 within `push!'
; ββ @ array.jl:884 within `_growend!'
call void inttoptr (i64 140358397421246 to void ({}*, i64)*)({}* nonnull %14, i64 1)
; ββ
; β @ array.jl:930 within `push!'
; ββ @ abstractarray.jl:347 within `lastindex'
; βββ @ abstractarray.jl:312 within `eachindex'
; ββββ @ abstractarray.jl:109 within `axes1'
; βββββ @ abstractarray.jl:89 within `axes'
; ββββββ @ array.jl:133 within `size'
%16 = bitcast {}* %14 to {}**
%17 = getelementptr inbounds {}*, {}** %16, i64 3
%18 = bitcast {}** %17 to i64*
%19 = load i64, i64* %18, align 8
; ββββββ
; ββ @ array.jl:839 within `setindex!'
%20 = add nsw i64 %19, -1
%21 = bitcast {}* %14 to { i8*, i64, i16, i16, i32 }*
%22 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %21, i64 0, i32 1
%23 = load i64, i64* %22, align 8
%24 = icmp ult i64 %20, %23
br i1 %24, label %idxend, label %oob
oob: ; preds = %pass
%25 = alloca i64, align 8
store i64 %19, i64* %25, align 8
call void @jl_bounds_error_ints({}* %14, i64* nonnull %25, i64 1)
unreachable
idxend: ; preds = %pass
%26 = bitcast {}* %14 to i64**
%27 = load i64*, i64** %26, align 8
%28 = getelementptr inbounds i64, i64* %27, i64 %20
store i64 1, i64* %28, align 8
%29 = load i64, i64* %9, align 8
store i64 %29, i64* %7, align 8
; ββ
ret {}* %14
}
The julia GC has no read barriers, but it needs a write barrier (to support the case where an old object is mutated to contain a reference to a young object):
julia> bar(a, b) = begin a[]=b; nothing; end;
julia> @code_llvm bar(Ref([1]), [1])
; @ REPL[30]:1 within `bar'
define nonnull {}* @japi1_bar_498({}* %0, {}** %1, i32 %2) #0 {
top:
%3 = alloca {}**, align 8
store volatile {}** %1, {}*** %3, align 8
%4 = load {}*, {}** %1, align 8
%5 = getelementptr inbounds {}*, {}** %1, i64 1
%6 = load {}*, {}** %5, align 8
; β @ refvalue.jl:57 within `setindex!'
; ββ @ Base.jl:34 within `setproperty!'
%7 = bitcast {}* %4 to {}**
store atomic {}* %6, {}** %7 unordered, align 8
%8 = bitcast {}* %4 to i64*
%9 = getelementptr inbounds i64, i64* %8, i64 -1
%10 = load atomic i64, i64* %9 unordered, align 8
%11 = and i64 %10, 3
%12 = icmp eq i64 %11, 3
br i1 %12, label %13, label %20
13: ; preds = %top
%14 = bitcast {}* %6 to i64*
%15 = getelementptr inbounds i64, i64* %14, i64 -1
%16 = load atomic i64, i64* %15 unordered, align 8
%17 = and i64 %16, 1
%18 = icmp eq i64 %17, 0
br i1 %18, label %19, label %20
19: ; preds = %13
call void @jl_gc_queue_root({}* nonnull %4)
br label %20
20: ; preds = %19, %13, %top
; ββ
ret {}* inttoptr (i64 140358107785408 to {}*)
}
Some effort has been spent to improve loop hoisting of the write barrier, but I am in no position to properly evaluate how good we are with this.
There is a known performance pitfall with respect to multithreading and GC: GC can only be started at safepoints (allocations, foreign function calls, explicit safepoints). The shadow stack only needs to be in a correct state at safepoints (this helps performance of loops). Tight optimized loops contain no safepoints.
You see the problem: One thread enters a tight highly optimized long loop that doesnβt hit a safepoint for several seconds, until it has finished munching through gigabytes of computation; in the meantime all other threads stall, waiting for an opportunity where all threads hit a safepoint and GC can start.
edit: improved write barrier example
For future work, it would be super cool to switch to using stack unwinding instead, possibly hook into the register spilling logic, and generally get rid of safepoints (i.e. use an interrupt to stop the world). However, this is unlikely to happen in the next couple of years. Another interesting idea is what the JVM does: implement safepoints by a load instruction that traps if we are supposed to yield to GC (i.e. load from a global address; when we want to stop the world, mark the page as non-readable and wait until all threads have hit our fault handler). This has the advantage that safepoints become very cheap and can be sprinkled into a lot more places.