How the compiler decides which allocation is eliminated?

Consider three functions:

foo1()  = (Vector()             ; 0)
foo2()  = (Vector(undef, 2^50)  ; 0)
foo3()  = (Vector(undef, -1)    ; 0)

On the current master, creation of Vector seems to be eliminated in foo1, foo2, and not eliminated in foo3 :

julia> @code_native debuginfo=:none dump_module=false foo1()
        .text
        endbr64
        mov     rax, qword ptr [rcx + 56]
        ret
        nop     dword ptr [rax]

julia> @code_native debuginfo=:none dump_module=false foo2()
        .text
        endbr64
        mov     rax, qword ptr [rcx + 56]
        ret
        nop     dword ptr [rax]

julia> @code_native debuginfo=:none dump_module=false foo3()
        .text
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        movabs  rax, offset jl_alloc_genericmemory
        movabs  rdi, offset jl_system_image_data
        mov     rsi, -1
        call    rax
        xor     eax, eax
        pop     rbp
        ret

How does the compiler decide which code to eliminate here? Thanks for the answer!

I bring this example because both foo2 and foo3 (edit: should) have apparently observable effects (OOM and ArgumentError, resp.), yet the behaviour is different.

OutOfMemoryError is an error that isn’t modeled (i.e. the 2nd example runs if you have enough RAM). The difference between foo2 and foo3 is that foo3 errors, while foo2 is just sometimes run on a computer without good enough specs.

3 Likes

I see. I guess it is reasonable with OutOfMemoryError. Are there other errors that are not modeled?

A follow-up question: do you know why vector creation is not eliminated in foo4 here :

(edit: what I mean is, if without the memory store the allocation can be eliminated, how does the memory store prevent elimination?)

julia> foo4() = (x = Vector{Int}(undef, 1); @inbounds x[1] = 1; 0)
foo4 (generic function with 1 method)

julia> @code_native debuginfo=:none dump_module=false foo4()
        .text
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13 + 16]
        mov     rax, qword ptr [rax + 16]
        mov     rax, qword ptr [rax]
        movabs  rax, offset jl_alloc_genericmemory
        movabs  rdi, offset jl_system_image_data
        mov     esi, 1
        call    rax
        mov     rax, qword ptr [rax + 8]
        mov     qword ptr [rax], 1
        xor     eax, eax
        pop     rbp
        ret
        nop     dword ptr [rax]

The only other error Jeff and I can think of that isn’t modeled is StackOverflow. (The reasoning is that ~any code theoretically can throw those, and they don’t exist within the Turing machine model of a machine).

For your followup, see make `memorynew` intrinsic by oscardssmith · Pull Request #55913 · JuliaLang/julia · GitHub which does make foo4 removable (but isn’t quite able to remove foo5)

function foo5()
    v = Vector{Int}(undef, 2^50)
    v[1] = 5
    v[1]
end
3 Likes

Perhaps also InterruptException, in some sense? Say,

function foo6()
    try
        # Expected to run for a long time
        for i in 1:2^30 end
    catch
        # We hope this runs when Ctrl + C is pressed
        println("Interrupted, printing useful stats: ...")
    end
end

julia> @code_native debuginfo=:none foo6()
        .text
        .file   "foo6"
        .section        .ltext,"axl",@progbits
        .globl  julia_foo6_2370                 # -- Begin function julia_foo6_2370
        .p2align        4, 0x90
        .type   julia_foo6_2370,@function
julia_foo6_2370:                        # @julia_foo6_2370
; Function Signature: foo6()
# %bb.0:                                # %top
        push    rbp
        mov     rbp, rsp
        xor     eax, eax
        pop     rbp
        ret
.Lfunc_end0:

For your followup, see make memorynew intrinsic by oscardssmith · Pull Request #55913 · JuliaLang/julia · GitHub which does make foo4 removable

That is amazing.

Perhaps also InterruptException

yeah, that too.

Where’s the definition of foo5? Curiously, in this thread I only see foo1, foo2, foo3, foo4, and foo6.

See What to do about asynchronous exceptions · Issue #52291 · JuliaLang/julia · GitHub for discussion on asynchronous exceptions.

2 Likes

@giordano How the compiler decides which allocation is eliminated? - #4 by Oscar_Smith has been edited so that the foo4 that was supposed to be foo5 is.

1 Like
julia> code_llvm((); debuginfo=:none) do
           v = Vector{Int}(undef, 251)
           v[1] = 5
           v[1]
       end
; Function Signature: var"#110"()
define i64 @"julia_#110_4587"() #0 {
L37:
  ret i64 5
}
julia> code_llvm((); debuginfo=:none) do
           v = Vector{Int}(undef, 252)
           v[1] = 5
           v[1]
       end
; Function Signature: var"#113"()
define i64 @"julia_#113_4592"() #0 {
L37:
  %pgcstack = call ptr inttoptr (i64 4333010700 to ptr)(i64 4333010736) #11
  %ptls_field = getelementptr inbounds i8, ptr %pgcstack, i64 16
  %ptls_load = load ptr, ptr %ptls_field, align 8
  %"Memory{Int64}[]" = call ptr @jl_alloc_genericmemory_unchecked(ptr %ptls_load, i64 2016, ptr nonnull @"+Core.GenericMemory#4594.jit")
  store i64 252, ptr %"Memory{Int64}[]", align 8
  fence syncscope("singlethread") release
  %memory_data_ptr = getelementptr inbounds { i64, ptr }, ptr %"Memory{Int64}[]", i64 0, i32 1
  %memory_data = load ptr, ptr %memory_data_ptr, align 8
  store i64 5, ptr %memory_data, align 8
  ret i64 5
}

what happens at size 252 that makes the allocation stick? It’s also a weird cutoff value, not being an exact power of 2, or an adjacent number (although it’s nearby, for some definition of closeness)

the thing that happens is the allocation moves from being a pool allocation to a malloc (the weird size is that the number of bytes for the Memory is 16+8*elsize), and dealing with that case would have required me to write another ~50 lines of LLVM to model the other allocation path, and I want to get an initial version of this merged and go from there.

4 Likes

A layman question: if I understand right, the PR eagerly evaluates the allocation, sometimes. For eliminating the allocation and the store in foo4, should there be a more high-level mechanism? Say, since the array is local to that function, and there are no loads from the array, eliminate the stores (which are dead anyway), and then eliminate the allocation ?

The hardest part about this is proving that the array is local (especially because BoundsErrors currently escape the array).

2 Likes

I see, thanks. I guess inbounds access is in fact proven, but is not used for optimization yet? Note getelementptr inbounds :

julia> function foo8()
           b = Vector{Int}(undef, 3)
           b[2] = 3
           0
       end

julia> @code_llvm debuginfo=:none foo8()
; Function Signature: foo8()
define i64 @julia_foo8_4126() #0 {
L18:
  %"Memory{Int64}[]" = call ptr @jl_alloc_genericmemory(ptr nonnull @"+Core.GenericMemory#4128.jit", i64 3)
  %memory_data_ptr = getelementptr inbounds { i64, ptr }, ptr %"Memory{Int64}[]", i64 0, i32 1
  %memory_data = load ptr, ptr %memory_data_ptr, align 8
  %memoryref_data4 = getelementptr inbounds i8, ptr %memory_data, i64 8
  store i64 3, ptr %memoryref_data4, align 8
  ret i64 0
}

you’re just seeing LLVM performing the proof of inboundsness. the reason why the memorynew pr is needed is LLVM doesn’t know that jl_alloc_genericmemory doesn’t leak the array it creates since it’s an arbitrary C function

1 Like

I came across something similar with tuples today. Looks like the compiler skips the tuple creation in g1 but not in g2:

g1() = (ntuple(i -> 1, 10); 0)
g2() = (ntuple(i -> 1, 11); 0)

@code_llvm debuginfo = :none g1()
# define i64 @julia_g1_1199() #0 {
# top:
#   ret i64 0
# }
@code_llvm debuginfo = :none g2()
# define i64 @julia_g2_1211() #0 {
# top:
#   %0 = call nonnull {}* @j__ntuple_1213(i64 signext 11)
#   ret i64 0
# }

Take a look at the implementation of ntuple (julia/base/ntuple.jl at v1.11.2 · JuliaLang/julia · GitHub):

@inline function ntuple(f::F, n::Integer) where F
    # marked inline since this benefits from constant propagation of `n`
    t = n == 0  ? () :
        n == 1  ? (f(1),) :
        n == 2  ? (f(1), f(2)) :
        n == 3  ? (f(1), f(2), f(3)) :
        n == 4  ? (f(1), f(2), f(3), f(4)) :
        n == 5  ? (f(1), f(2), f(3), f(4), f(5)) :
        n == 6  ? (f(1), f(2), f(3), f(4), f(5), f(6)) :
        n == 7  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7)) :
        n == 8  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8)) :
        n == 9  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8), f(9)) :
        n == 10 ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8), f(9), f(10)) :
        _ntuple(f, n)
    return t
end

function _ntuple(f::F, n) where F
    @noinline
    (n >= 0) || throw(ArgumentError(LazyString("tuple length should be ≥ 0, got ", n)))
    ([f(i) for i = 1:n]...,)
end

The outer function will be inlined, and since n is a compile-time constant the branches are eliminated. For n <= 10 that only leaves a hardcoded tuple instantiation, which is clearly side-effect-free and can be eliminated. For n > 10 what’s left is a call to a type-unstable inner function that allocates an intermediate array and has a branch that throws. So I think your observation is explained by the completely different code you end up with in the two cases, rather than general compiler heuristics with regard to tuples.

5 Likes

Makes sense. I guess the compiler can’t get rid of the tuple creation because it could throw an error. If you lie and tell it that creating the tuple never throws an error, it seems to get rid of it and any accompanying intermediate array creation.

function f()
       Base.@assume_effects :nothrow g() = ntuple(i -> 1, 11)
       g()
       return 0
end
@code_llvm debuginfo = :none f()
# define i64 @julia_f_884() #0 {
# top:
#   ret i64 0
# }

If you used ntuple(f, Val(N)) instead, then it’ll compile away from any(?) N.

julia> g3() = (ntuple(i -> 1, Val(11)); 0);

julia> @code_llvm g3()
; Function Signature: g3()
;  @ REPL[27]:1 within `g3`
define i64 @julia_g3_20024() #0 {
top:
  ret i64 0
}
julia> g4() = (ntuple(i -> 1, Val(111)); 0);

julia> @code_llvm g4()
; Function Signature: g4()
;  @ REPL[29]:1 within `g4`
define i64 @julia_g4_20028() #0 {
top:
  ret i64 0
}