More readable if using Intel syntax and removing debug info:
julia> @cn h(1)
.text
lea rax, [rdi + 2*rdi]
add rax, rax
add rax, 4
ret
nop dword ptr [rax]
we can see that it first calculates rdi + 2*rdi
(== 3*rdi
), assigning it to rax
, which it then adds to itself (same as *2
), and finally adds 4
.
This is exactly what we would have gotten from f(x) = 6x+4
(i.e., the compiler would split the *6
into lea
and add
).
This is different than what you reported, which seems to have replaced the last two add
s with
lea rax, [rax + rax + 4]
Which does the same thing, but uses one less instruction.
I tried a couple different Julia+LLVM versions and got the two add
s each time, so I don’t think this difference is because of LLVM version.
I assume this was on your Zen2 computer?
Checking Agner Fog’s instruction tables, lea
is faster on Zen2 than it is on Skylake(-X).
so LLVM picked the version fastest for our specific CPUs.
Reciprocal throughputs (lower is better):
Instruction | Zen2 | Skylake-X |
---|---|---|
lea-2 | 1/4 | 1/2 |
lea-3 | 1/4 | 1 |
add | 1/3 | 1/4 |
The N
in lea-N
means how many arguments. So lea rax, [rax + rax + 4]
would be lea-3
, which would be much faster on Zen2 than on Skylake-X.