Trying to understand performance of closure from @code_llvm

tomohiro_soejima · October 31, 2020, 11:01am

Hi,

I am trying to better understand how closure works. I’ve inspected captured variable section of the manual, and tried inspecting the following functions.

function abmult(r::Int)
    if r < 0
        r = -r
    end
    f = x -> x * r
    return f
end

function abmult2(r::Int)
    f = x -> x * r
    return f
end

function abmult3(r::Int)
    if r < 0
        r = -r
    end
    f = let r = r
        x -> x * r
    end
    return f
end

mul1 = abmult(3)
mul2 = abmult2(3)
mul3 = abmult3(3)

@code_llvm mul1(5)
@code_llvm mul2(5)
@code_llvm mul3(5)

As explained in the manual, @code_llvm of mul1 is a complete mess due to the parser’s inability to handle this code. On the other hand, mul2 and mul3 have identical @code_llvm:

define i64 @"julia_#65_2320"([1 x i64]* nocapture nonnull readonly dereferenceable(8), i64) {
top:
  %2 = getelementptr inbounds [1 x i64], [1 x i64]* %0, i64 0, i64 0
  %3 = load i64, i64* %2, align 8
  %4 = mul i64 %3, %1
  ret i64 %4
}

This has a lot of keywords I do not understand (nocapture, nonull etc.). Naively I would have thought the output will be identical to the following:

mul4 = x->x*5
@code_llvm debuginfo=:none mul4(3)


define i64 @"julia_#71_2323"(i64) {
top:
  %1 = mul i64 %0, 5
  ret i64 %1
}

since the captured variable can no longer change. On the other hand, despite having extra steps, my crude benchmarks could not detect meaningful difference between mul2 and mul4:

A = rand(1000)
@btime sum($mul1, $A)
@btime sum($mul2, $A)
@btime sum($mul3, $A)
@btime sum($mul4, $A)

  24.046 μs (2999 allocations: 46.86 KiB)
  57.928 ns (0 allocations: 0 bytes)
  57.923 ns (0 allocations: 0 bytes)
  56.711 ns (0 allocations: 0 bytes)

My questions

In a very broad streak, what does the output of @code_llvm mul2(3) mean? What is it doing? (I am very unfamiliar with LLVM IR, and only know most basic commands like ret and mul, so a highly dumbed-down version is very welcome).
It looks like mul2 and mul3 are carrying around an extra variable, even though there is no way to modify it afterward. Why doesn’t it just get processed via constant propagation?
Should I expect any performance difference between mul2 and mul4? Should I expect any difference in more complicated cases?

Elrod · August 26, 2024, 8:15pm

Specializing * on a constant doesn’t help much for most values, so the difference between loading data from the closure itself (mul2) vs the data section (mul4) isn’t that meaningful.
Doing integer divisions is likely to result in a bigger difference.

Elrod · August 26, 2024, 8:17pm

How is the constant 3 supposed to propagate from mul2 = abmult2(3) to where you use it later?
mul2 would need a unique type, depend on 3, for that.

In actual code, it’ll propagate when it makes sense.

julia> f(x) = abmult2(3)(x)
f (generic function with 1 method)

julia> @code_llvm f(5)
;  @ REPL[16]:1 within `f`
define i64 @julia_f_611(i64 signext %0) #0 {
top:
; ┌ @ REPL[2]:2 within `#3`
; │┌ @ int.jl:88 within `*`
    %1 = mul i64 %0, 3
    ret i64 %1
; └└
}

EDIT: oops, just saw the topic is 4 years old!
I’m not guilty of bumping it. Someone else edited the open post, bumping the thread.

Topic		Replies	Views
Can someone explain closures to me Internals & Design performance , type-stability , closure	13	2901	November 2, 2023
RFC: Some Ideas to Tackle #15276 - performance of captured variables in closures Internals inference , type-stability , corebox	65	4387	February 16, 2024
Bounds check outside loop affects loop performance Performance loops	0	139	January 31, 2024
Performance of closures General Usage performance , closure	6	2146	February 10, 2017
Closures: Is this docs sentence still relevant? Performance closure	1	410	May 21, 2019

Trying to understand performance of closure from @code_llvm

Related topics