Hmm, I think LLVM is just calling the system libraries, but with a faster calling convention, using just a jmp
instead of a call
:
# julia> @code_native syntax=:intel debuginfo=:none llvmexp(1.2)
.text
movabs rax, offset exp
jmp rax
nop dword ptr [rax]
# julia> @code_native syntax=:intel debuginfo=:none cexp(1.2)
.text
push rax
movabs rax, offset exp
call rax
pop rax
ret
nop
My earlier tests were on a 7900X (Intel Skylake-X CPU) running Clear Linux.
So it seems Apple has faster system log and exp than Clear Linux.
I just ran them on a 7980XE CPU running Arch Linux (same generation of CPU, just a different model with more cores):
julia> @btime log($(Ref(1.2))[])
5.424 ns (0 allocations: 0 bytes)
0.1823215567939546
julia> @btime clog($(Ref(1.2))[])
8.146 ns (0 allocations: 0 bytes)
0.1823215567939546
julia> @btime llvmlog($(Ref(1.2))[])
7.690 ns (0 allocations: 0 bytes)
0.1823215567939546
julia> @btime exp($(Ref(1.2))[])
5.154 ns (0 allocations: 0 bytes)
3.3201169227365472
julia> @btime cexp($(Ref(1.2))[])
22.318 ns (0 allocations: 0 bytes)
3.3201169227365472
julia> @btime llvmexp($(Ref(1.2))[])
22.315 ns (0 allocations: 0 bytes)
3.3201169227365472
Julia timings are consistent across all three computers and OSes, but Arch’s exp is over 4x slower than Apple’s?
I think Clear Linux beats Arch here because on Clear Linux glibc detects hardware features at startup so it can run hardware-specific optimized versions. I’m guessing – from those benchmarks – that Arch does not do this.
Also guessing that Apple does something similar, but apparently better optimized.
It’s also a little funny how slow this ccall
is.
On Linux, folks can try
using Libdl
const LIBMVEC = find_library(["libmvec.so"], ["/usr/lib64/", "/usr/lib", "/lib/x86_64-linux-gnu"])
run(pipeline(`nm -D $LIBMVEC`, `grep exp`)) # check function names
On Apple, I’d look through the AppleAccelerate libraries to see what you can find.
I get:
julia> run(pipeline(`nm -D $LIBMVEC`, `grep exp`))
U exp@GLIBC_2.29
U expf@GLIBC_2.27
0000000000002b80 i _ZGVbN2v_exp@@GLIBC_2.22
0000000000002c80 i _ZGVbN4v_expf@@GLIBC_2.22
0000000000002bb0 T _ZGVcN4v_exp@@GLIBC_2.22
0000000000002cb0 T _ZGVcN8v_expf@@GLIBC_2.22
0000000000002c00 i _ZGVdN4v_exp@@GLIBC_2.22
0000000000002d00 i _ZGVdN8v_expf@@GLIBC_2.22
0000000000002d30 i _ZGVeN16v_expf@@GLIBC_2.22
0000000000002c30 i _ZGVeN8v_exp@@GLIBC_2.22
Base.ProcessChain(Base.Process[Process(`nm -D /usr/lib64/libmvec.so`, ProcessExited(0)), Process(`grep exp`, ProcessExited(0))], Base.DevNull(), Base.DevNull(), Base.DevNull())
_ZGVeN8v_exp
is for 8x Float64 and requires AVX512, which this computer has.
const SIMDVec{W,T} = NTuple{W,Core.VecElement{T}}
vexp(x::SIMDVec{8,Float64}) = @ccall LIBMVEC._ZGVeN8v_exp(x::SIMDVec{8,Float64})::SIMDVec{8,Float64}
t = ntuple(_ -> randn(), Val(8))
vet = map(Core.VecElement, t);
exp.(t)
map(x -> x.value, vexp(vet))
@btime exp.($(Ref(t))[])
@btime vexp($(Ref(vet))[])
I get
julia> exp.(t)
(7.063262403134325, 0.4176291014524549, 1.4764646883598358, 0.48878815987898877, 0.15684534932764455, 7.226086027587759, 2.3485301741066205, 5.409615070945328)
julia> map(x -> x.value, vexp(vet))
(7.063262403134324, 0.41762910145245496, 1.4764646883598358, 0.48878815987898866, 0.15684534932764457, 7.226086027587759, 2.348530174106621, 5.409615070945328)
julia> @btime exp.($(Ref(t))[])
44.465 ns (0 allocations: 0 bytes)
(7.063262403134325, 0.4176291014524549, 1.4764646883598358, 0.48878815987898877, 0.15684534932764455, 7.226086027587759, 2.3485301741066205, 5.409615070945328)
julia> @btime vexp($(Ref(vet))[])
5.029 ns (0 allocations: 0 bytes)
(VecElement{Float64}(7.063262403134324), VecElement{Float64}(0.41762910145245496), VecElement{Float64}(1.4764646883598358), VecElement{Float64}(0.48878815987898866), VecElement{Float64}(0.15684534932764457), VecElement{Float64}(7.226086027587759), VecElement{Float64}(2.348530174106621), VecElement{Float64}(5.409615070945328))
So 5.029 ns
for GLIBC to calculate 8 exps when I specifically call the AVX-512 version, but (on Arch Linux) by default it’ll call some slow generic version that takes over 4 times longer to calculate just a single exponential. Ha.
GLIBC can have implementations all it wants, but it doesn’t do any good if they don’t get used. =/
Julia using it’s own libraries provides some consistency, in particular helping performance for some folks (like those on Arch), and also making sure everyone’s implementation are help to roughly the same accuracy standard.
Julia’s exp
is more accurate than the SIMD version from GLIBC, for example, but otherwise follows a similar implementation approach (which I described, and @Oscar_Smith implemented + increased the accuracy of based on the description without ever looking at GPL source).
EDIT:
On a different computer running Ubuntu (but with a different CPU, an i7 1165G7):
julia> @btime log($(Ref(1.2))[])
3.863 ns (0 allocations: 0 bytes)
0.1823215567939546
julia> @btime clog($(Ref(1.2))[])
6.578 ns (0 allocations: 0 bytes)
0.1823215567939546
julia> @btime llvmlog($(Ref(1.2))[])
5.971 ns (0 allocations: 0 bytes)
0.1823215567939546
julia> @btime exp($(Ref(1.2))[])
3.856 ns (0 allocations: 0 bytes)
3.3201169227365472
julia> @btime cexp($(Ref(1.2))[])
18.564 ns (0 allocations: 0 bytes)
3.3201169227365472
julia> @btime llvmexp($(Ref(1.2))[])
18.492 ns (0 allocations: 0 bytes)
3.3201169227365472
julia> versioninfo()
Julia Version 1.7.0-DEV.526
Commit 6468dcb04e* (2021-02-13 02:44 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
JULIA_NUM_THREADS = 8
Thankfully it sounds like all Linux distros (not just Clear) should start benefiting from similar optimizations fairly soon.