Trying to understand performance of closure from @code_llvm

Hi,

I am trying to better understand how closure works. I’ve inspected captured variable section of the manual, and tried inspecting the following functions.

function abmult(r::Int)
    if r < 0
        r = -r
    end
    f = x -> x * r
    return f
end

function abmult2(r::Int)
    f = x -> x * r
    return f
end

function abmult3(r::Int)
    if r < 0
        r = -r
    end
    f = let r = r
        x -> x * r
    end
    return f
end

mul1 = abmult(3)
mul2 = abmult2(3)
mul3 = abmult3(3)

@code_llvm mul1(5)
@code_llvm mul2(5)
@code_llvm mul3(5)

As explained in the manual, @code_llvm of mul1 is a complete mess due to the parser’s inability to handle this code. On the other hand, mul2 and mul3 have identical @code_llvm:

define i64 @"julia_#65_2320"([1 x i64]* nocapture nonnull readonly dereferenceable(8), i64) {
top:
  %2 = getelementptr inbounds [1 x i64], [1 x i64]* %0, i64 0, i64 0
  %3 = load i64, i64* %2, align 8
  %4 = mul i64 %3, %1
  ret i64 %4
}

This has a lot of keywords I do not understand (nocapture, nonull etc.). Naively I would have thought the output will be identical to the following:

mul4 = x->x*5
@code_llvm debuginfo=:none mul4(3)


define i64 @"julia_#71_2323"(i64) {
top:
  %1 = mul i64 %0, 5
  ret i64 %1
}

since the captured variable can no longer change. On the other hand, despite having extra steps, my crude benchmarks could not detect meaningful difference between mul2 and mul4:

A = rand(1000)
@btime sum($mul1, $A)
@btime sum($mul2, $A)
@btime sum($mul3, $A)
@btime sum($mul4, $A)

  24.046 μs (2999 allocations: 46.86 KiB)
  57.928 ns (0 allocations: 0 bytes)
  57.923 ns (0 allocations: 0 bytes)
  56.711 ns (0 allocations: 0 bytes)

My questions

  1. In a very broad streak, what does the output of @code_llvm mul2(3) mean? What is it doing? (I am very unfamiliar with LLVM IR, and only know most basic commands like ret and mul, so a highly dumbed-down version is very welcome).
  2. It looks like mul2 and mul3 are carrying around an extra variable, even though there is no way to modify it afterward. Why doesn’t it just get processed via constant propagation?
  3. Should I expect any performance difference between mul2 and mul4? Should I expect any difference in more complicated cases?