CAS benchmarks (Symbolics.jl and Maxima)

More readable if using Intel syntax and removing debug info:

julia> @cn h(1)
        .text
        lea     rax, [rdi + 2*rdi]
        add     rax, rax
        add     rax, 4
        ret
        nop     dword ptr [rax]

we can see that it first calculates rdi + 2*rdi (== 3*rdi), assigning it to rax, which it then adds to itself (same as *2), and finally adds 4.
This is exactly what we would have gotten from f(x) = 6x+4 (i.e., the compiler would split the *6 into lea and add).
This is different than what you reported, which seems to have replaced the last two adds with

        lea     rax, [rax + rax + 4]

Which does the same thing, but uses one less instruction.

I tried a couple different Julia+LLVM versions and got the two adds each time, so I don’t think this difference is because of LLVM version.
I assume this was on your Zen2 computer?

Checking Agner Fog’s instruction tables, lea is faster on Zen2 than it is on Skylake(-X).
so LLVM picked the version fastest for our specific CPUs.

Reciprocal throughputs (lower is better):

Instruction Zen2 Skylake-X
lea-2 1/4 1/2
lea-3 1/4 1
add 1/3 1/4

The N in lea-N means how many arguments. So lea rax, [rax + rax + 4] would be lea-3, which would be much faster on Zen2 than on Skylake-X.

5 Likes