More readable if using Intel syntax and removing debug info:
julia> @cn h(1)
.text
lea rax, [rdi + 2*rdi]
add rax, rax
add rax, 4
ret
nop dword ptr [rax]
we can see that it first calculates rdi + 2*rdi (== 3*rdi), assigning it to rax, which it then adds to itself (same as *2), and finally adds 4.
This is exactly what we would have gotten from f(x) = 6x+4 (i.e., the compiler would split the *6 into lea and add).
This is different than what you reported, which seems to have replaced the last two adds with
lea rax, [rax + rax + 4]
Which does the same thing, but uses one less instruction.
I tried a couple different Julia+LLVM versions and got the two adds each time, so I don’t think this difference is because of LLVM version.
I assume this was on your Zen2 computer?
Checking Agner Fog’s instruction tables, lea is faster on Zen2 than it is on Skylake(-X).
so LLVM picked the version fastest for our specific CPUs.
Reciprocal throughputs (lower is better):
| Instruction | Zen2 | Skylake-X |
|---|---|---|
| lea-2 | 1/4 | 1/2 |
| lea-3 | 1/4 | 1 |
| add | 1/3 | 1/4 |
The N in lea-N means how many arguments. So lea rax, [rax + rax + 4] would be lea-3, which would be much faster on Zen2 than on Skylake-X.