Trying to understand performance of closure from @code_llvm

Hi,

I am trying to better understand how closure works. I’ve inspected captured variable section of the manual, and tried inspecting the following functions.

function abmult(r::Int)
    if r < 0
        r = -r
    end
    f = x -> x * r
    return f
end

function abmult2(r::Int)
    f = x -> x * r
    return f
end

function abmult3(r::Int)
    if r < 0
        r = -r
    end
    f = let r = r
        x -> x * r
    end
    return f
end

mul1 = abmult(3)
mul2 = abmult2(3)
mul3 = abmult3(3)

@code_llvm mul1(5)
@code_llvm mul2(5)
@code_llvm mul3(5)

As explained in the manual, @code_llvm of mul1 is a complete mess due to the parser’s inability to handle this code. On the other hand, mul2 and mul3 have identical @code_llvm:

define i64 @"julia_#65_2320"([1 x i64]* nocapture nonnull readonly dereferenceable(8), i64) {
top:
  %2 = getelementptr inbounds [1 x i64], [1 x i64]* %0, i64 0, i64 0
  %3 = load i64, i64* %2, align 8
  %4 = mul i64 %3, %1
  ret i64 %4
}

This has a lot of keywords I do not understand (nocapture, nonull etc.). Naively I would have thought the output will be identical to the following:

mul4 = x->x*5
@code_llvm debuginfo=:none mul4(3)


define i64 @"julia_#71_2323"(i64) {
top:
  %1 = mul i64 %0, 5
  ret i64 %1
}

since the captured variable can no longer change. On the other hand, despite having extra steps, my crude benchmarks could not detect meaningful difference between mul2 and mul4:

A = rand(1000)
@btime sum($mul1, $A)
@btime sum($mul2, $A)
@btime sum($mul3, $A)
@btime sum($mul4, $A)

  24.046 μs (2999 allocations: 46.86 KiB)
  57.928 ns (0 allocations: 0 bytes)
  57.923 ns (0 allocations: 0 bytes)
  56.711 ns (0 allocations: 0 bytes)

My questions

  1. In a very broad streak, what does the output of @code_llvm mul2(3) mean? What is it doing? (I am very unfamiliar with LLVM IR, and only know most basic commands like ret and mul, so a highly dumbed-down version is very welcome).
  2. It looks like mul2 and mul3 are carrying around an extra variable, even though there is no way to modify it afterward. Why doesn’t it just get processed via constant propagation?
  3. Should I expect any performance difference between mul2 and mul4? Should I expect any difference in more complicated cases?

Specializing * on a constant doesn’t help much for most values, so the difference between loading data from the closure itself (mul2) vs the data section (mul4) isn’t that meaningful.
Doing integer divisions is likely to result in a bigger difference.

How is the constant 3 supposed to propagate from mul2 = abmult2(3) to where you use it later?
mul2 would need a unique type, depend on 3, for that.

In actual code, it’ll propagate when it makes sense.

julia> f(x) = abmult2(3)(x)
f (generic function with 1 method)

julia> @code_llvm f(5)
;  @ REPL[16]:1 within `f`
define i64 @julia_f_611(i64 signext %0) #0 {
top:
; ┌ @ REPL[2]:2 within `#3`
; │┌ @ int.jl:88 within `*`
    %1 = mul i64 %0, 3
    ret i64 %1
; └└
}

EDIT: oops, just saw the topic is 4 years old!
I’m not guilty of bumping it. Someone else edited the open post, bumping the thread.