Tell me about escape analysis and optimizing heap allocations

Trying to read into the recent work on escape analysis a bit more, but I’m having trouble answering a couple basic questions of mine because my knowledge on this is very limited. Hopefully people in the know can provide accessible explanations.

  1. One of the stated goals is moving allocations from the heap to the stack. But posts here from 2018-2020, well before the current escape analysis work, already demonstrated isolated cases of non-escaping mutable structs not being allocated on the heap. So what exactly was the state of heap-to-stack(-or-elsewhere) optimizations before and now, and what improvements were or will be enabled by newer escape analysis?

  2. Something that I expected to be an obvious goal of finding non-escaping heap allocations is inserting deallocations at the end of the scope, if stack allocation is not feasible for some reason e.g. unfixed size. But I don’t really see this talked about beyond a few scattered posts here, not even in articles about escape analysis in other languages (Go, Java). It’s extra strange to me because finalizer elision is a stated goal, and I can’t imagine why freeing memory isn’t included right after running a finalizer.

1 Like

I did an exploration here where I tried to see if I could get a finalizer that frees a pointer to inline.

I made some progress. If I do nothing with the smart pointer, it gets deallocated.

2 Likes

Now a bit more confused because the linked eager finalization issue does make it seem like finalization and deallocation are happening together in the escape analysis, but I was under the impression they are separate actions (can manually finalize something with live references so GC cannot collect it yet) so deallocation wasn’t a design priority.

There are several optimizations happening now in Julia 1.9.

julia> function bar(x)
           r = Ref(5)
           r[] += x
           finalizer(x->nothing, r)
           return r[]
       end
bar (generic function with 1 method)

Under Julia 1.8.x, this function would allocate memory.

julia> @code_llvm bar(5)
;  @ REPL[5]:1 within `bar`
define i64 @julia_bar_320(i64 signext %0) #0 {
top:
  %gcframe4 = alloca [3 x {}*], align 16
  %gcframe4.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 0
  %1 = bitcast [3 x {}*]* %gcframe4 to i8*
  call void @llvm.memset.p0i8.i32(i8* noundef nonnull align 16 dereferenceable(24) %1, i8 0, i32 24, i1 false)
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #4
  %ppgcstack_i8 = getelementptr i8, i8* %thread_ptr, i64 -8
  %ppgcstack = bitcast i8* %ppgcstack_i8 to {}****
  %pgcstack = load {}***, {}**** %ppgcstack, align 8
;  @ REPL[5]:2 within `bar`
; ┌ @ refpointer.jl:134 within `Ref`
; │┌ @ refvalue.jl:10 within `RefValue` @ refvalue.jl:8
    %2 = bitcast [3 x {}*]* %gcframe4 to i64*
    store i64 4, i64* %2, align 16
    %3 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 1
    %4 = bitcast {}** %3 to {}***
    %5 = load {}**, {}*** %pgcstack, align 8
    store {}** %5, {}*** %4, align 8
    %6 = bitcast {}*** %pgcstack to {}***
    store {}** %gcframe4.sub, {}*** %6, align 8
    %ptls_field5 = getelementptr inbounds {}**, {}*** %pgcstack, i64 2
    %7 = bitcast {}*** %ptls_field5 to i8**
    %ptls_load67 = load i8*, i8** %7, align 8
    %8 = call noalias nonnull {}* @ijl_gc_pool_alloc(i8* %ptls_load67, i32 1392, i32 16) #5
    %9 = bitcast {}* %8 to i64*
    %10 = getelementptr inbounds i64, i64* %9, i64 -1
    store atomic i64 140139148322928, i64* %10 unordered, align 8
; └└
;  @ REPL[5]:3 within `bar`
; ┌ @ int.jl:87 within `+`
   %11 = add i64 %0, 5
; └
; ┌ @ refvalue.jl:57 within `setindex!`
; │┌ @ Base.jl:39 within `setproperty!`
    store i64 %11, i64* %9, align 8
    %12 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe4, i64 0, i64 2
    store {}* %8, {}** %12, align 16
; └└
;  @ REPL[5]:4 within `bar`
; ┌ @ gcutils.jl:48 within `finalizer`
; │┌ @ boot.jl:364 within `getptls`
    %13 = call i64 inttoptr (i64 140139484240096 to i64 ()*)()
; │└
   call void inttoptr (i64 140139484263040 to void (i64, {}*, {}*)*)(i64 %13, {}* nonnull %8, {}* inttoptr (i64 140139382781504 to {}*))
; └
;  @ REPL[5]:5 within `bar`
; ┌ @ refvalue.jl:56 within `getindex`
; │┌ @ Base.jl:38 within `getproperty`
    %14 = load i64, i64* %9, align 8
    %15 = load {}*, {}** %3, align 8
    %16 = bitcast {}*** %pgcstack to {}**
    store {}* %15, {}** %16, align 8
; └└
  ret i64 %14
}

julia> @time bar(5)
  0.000002 seconds (1 allocation: 16 bytes)
10

Under Julia 1.9.0-beta2 this does not allocate memory and is now quite simple. In fact the finalizer is just ignored.

julia> @code_llvm bar(5)
;  @ REPL[1]:1 within `bar`
define i64 @julia_bar_109(i64 signext %0) #0 {
top:
;  @ REPL[1]:3 within `bar`
; ┌ @ int.jl:87 within `+`
   %1 = add i64 %0, 5
; └
;  @ REPL[1]:5 within `bar`
  ret i64 %1
}

julia> @time bar(5)
  0.000001 seconds
10

Additionally, the finalizers are being inlined under some limited circumstances now.

julia> global deallocate_count::Int = 0
0

julia> function foo()
           global deallocate_count
           for i in 1:100
               r = Ref(1)
               finalizer(r) do r
                   deallocate_count += 1
               end
           end
           return deallocate_count
       end
foo (generic function with 1 method)

julia> deallocate_count::Int = 0
0

julia> function foo()
           global deallocate_count
           for i in 1:100
               r = Ref(1)
               finalizer(r) do r
                   deallocate_count += 1
               end
           end
           return deallocate_count
       end
foo (generic function with 1 method)

julia> foo()
100

julia> foo()
200

julia> foo()
300

julia> foo()
400

julia> foo()
500

julia> @code_llvm foo()
;  @ REPL[29]:1 within `foo`
define i64 @julia_foo_354() #0 {
top:
;  @ REPL[29]:3 within `foo`
  br label %L2

L2:                                               ; preds = %14, %top
  %value_phi = phi i64 [ 1, %top ], [ %15, %14 ]
;  @ REPL[29]:5 within `foo`
; ┌ @ gcutils.jl:56 within `finalizer`
; │┌ @ REPL[29]:6 within `#29`
    %0 = load atomic i64*, i64** inttoptr (i64 140167192958296 to i64**) unordered, align 8
; ││┌ @ int.jl:87 within `+`
     %1 = load i64, i64* %0, align 8
     %2 = add i64 %1, 1
; ││└
    %3 = call nonnull {}* @ijl_box_int64(i64 signext %2)
    store atomic {}* %3, {}** inttoptr (i64 140167192958296 to {}**) release, align 8
    %4 = load atomic i64, i64* getelementptr inbounds (i64, i64* inttoptr (i64 140167192958288 to i64*), i64 -1) unordered, align 8
    %5 = and i64 %4, 3
    %6 = icmp eq i64 %5, 3
    br i1 %6, label %7, label %14

7:                                                ; preds = %L2
    %8 = bitcast {}* %3 to i64*
    %9 = getelementptr inbounds i64, i64* %8, i64 -1
    %10 = load atomic i64, i64* %9 unordered, align 8
    %11 = and i64 %10, 1
    %12 = icmp eq i64 %11, 0
    br i1 %12, label %13, label %14

13:                                               ; preds = %7
    call void @jl_gc_queue_binding({}* inttoptr (i64 140167192958288 to {}*))
    br label %14

14:                                               ; preds = %13, %7, %L2
; └└
;  @ REPL[29]:8 within `foo`
; ┌ @ range.jl:891 within `iterate`
; │┌ @ promotion.jl:499 within `==`
    %.not = icmp eq i64 %value_phi, 100
; │└
   %15 = add nuw nsw i64 %value_phi, 1
; └
  br i1 %.not, label %L16, label %L2

L16:                                              ; preds = %14
;  @ REPL[29]:9 within `foo`
  %16 = load atomic i64*, i64** inttoptr (i64 140167192958296 to i64**) unordered, align 8
  %17 = load i64, i64* %16, align 8
  ret i64 %17
}

My sense of the current situation is that Julia is just getting started with the finalizer inlining. The application of escape analysis and finalizer inlining have only been applied to the easiest to prove cases so far.