Note that xmm
instruction that end in ___p_
(i.e., p
is the second last letter), they are SIMD, although just 128 bits. Godbolt is great for looking at this.
Using your flags:
.L2:
movdqa xmm1, xmm2
paddd xmm2, XMMWORD PTR .LC4[rip]
movaps XMMWORD PTR [rsp+48], xmm3
add ebx, 1
movaps XMMWORD PTR [rsp+16], xmm1
cvtdq2pd xmm0, xmm1
movaps XMMWORD PTR [rsp+32], xmm2
call _ZGVbN2v_sin
movdqa xmm1, XMMWORD PTR [rsp+16]
movaps XMMWORD PTR [rsp], xmm0
pshufd xmm0, xmm1, 238
cvtdq2pd xmm0, xmm0
call _ZGVbN2v_sin
addpd xmm0, XMMWORD PTR [rsp]
movapd xmm3, XMMWORD PTR [rsp+48]
cmp ebx, 24999999
movdqa xmm2, XMMWORD PTR [rsp+32]
addpd xmm3, xmm0
jne .L2
These are “pd” or “packed double” operations. The call itself is to call _ZGVbN2v_sin
, which is the mangled glibc name for a vector of 2. So the 1.9x faster you saw is about what what we should expect.
For wider SIMD, you’re missing the appropriate -march
flag. My CPU is a cascadelake, so I added -march=cascadelake
:
.L2:
vmovdqa64 zmm1, zmm2
vpaddd zmm2, zmm2, ZMMWORD PTR .LC4[rip]
vmovapd ZMMWORD PTR [rbp-880], zmm3
vmovdqa64 ZMMWORD PTR [rbp-816], zmm2
vmovdqa64 ZMMWORD PTR [rbp-752], zmm1
vcvtdq2pd zmm0, ymm1
call _ZGVeN8v_sin
vmovdqa64 zmm1, ZMMWORD PTR [rbp-752]
vmovapd ZMMWORD PTR [rbp-688], zmm0
vextracti32x8 ymm1, zmm1, 0x1
vcvtdq2pd zmm0, ymm1
call _ZGVeN8v_sin
vaddpd zmm0, zmm0, ZMMWORD PTR [rbp-688]
vmovapd zmm3, ZMMWORD PTR [rbp-880]
inc ebx
cmp ebx, 6249999
vaddpd zmm3, zmm3, zmm0
vmovdqa64 zmm2, ZMMWORD PTR [rbp-816]
jne .L2
Now we get zmm
(512) bit registers, and a call to call _ZGVeN8v_sin
. I also added gfortran 11 output to the Godbolt, as that is the version I am on.
gfortran --version (base)
GNU Fortran (Clear Linux OS for Intel Architecture) 11.1.1 20210505 releases/gcc-11.1.0-64-ge9a8d6852c
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
For some reason, performance is bad:
> gfortran -Ofast -march=native -mprefer-vector-width=512 avxsin.f90 -S -o avxsin.s
> gfortran -Ofast -march=native -mprefer-vector-width=512 avxsin.f90
> ./a.out
Time 1.1339029999999999
1.7136493465704397
I confirmed I also had the zmm instructions, but it’s only about 2.4x faster, not 7x+ like I’d expect.
julia> function f(N)
s = 0.0
for i in 1:N
s += sin(i)
end
s
end
f (generic function with 1 method)
julia> @time r = f(100000000)
2.744642 seconds (1 allocation: 16 bytes)
1.7136493465703402
Changing up the march
flag:
> gfortran -Ofast avxsin.f90
> ./a.out
Time 1.7568840000000001
1.7136493465696847
> gfortran -Ofast -march=skylake avxsin.f90
> ./a.out
Time 1.2096160000000000
1.7136493465701701
> gfortran -Ofast -march=skylake-avx512 avxsin.f90
> ./a.out
Time 1.1435370000000000
1.7136493465701701
> gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 avxsin.f90
> ./a.out
Time 1.1281299999999999
1.7136493465704397
Yes, Julia still has high latency. Starting Julia by itself already takes a couple hundred milliseconds:
> time julia -O3 -C"native,-prefer-256-bit" -q --startup=no -E "1+1" (base)
2
________________________________________________________
Executed in 114.51 millis fish external
usr time 77.36 millis 653.00 micros 76.71 millis
sys time 125.27 millis 0.00 micros 125.27 millis
so this does heavily shape people’s workflows, i.e. primarily working out of long living Julia sessions, e.g. REPLs or notebooks. Popular tooling like Revise.jl
reflect this.
Note that Julia starts with 1 thread by default, so your @avxt
version was probably also single threaded.
The sin
implementation doesn’t use a table, but there are a few issues.
Julia’s special functions are normally not inlined. Inlining happens in Julia’s compiler, before handing things off to LLVM, and LLVM normally makes the vectorization decisions (e.g., when using @simd
); when faced with a non-intrinsic call, and forbidden from ininling things itself, there’s nothing LLVM can do.
Another is that the implemetnations often aren’t SIMD friendly. Branches should be minimal. SLEEFPirates.jl
provides branchless special functions that are also marked @inline
so that @simd
will vectorize them:
julia> function f_avx(N)
s = 0.0
@avx for i in 1:N
s += sin(i)
end
s
end
f_avx (generic function with 1 method)
julia> function f_simd(N)
s = 0.0
@simd for i in 1:N
s += SLEEFPirates.sin(Float64(i))
end
s
end
f_simd (generic function with 1 method)
julia> function f_simd_fast(N)
s = 0.0
@simd for i in 1:N
s += SLEEFPirates.sin_fast(Float64(i))
end
s
end
f_simd_fast (generic function with 1 method)
julia> @time f_simd(100_000_000)
0.266950 seconds
1.7136493465702665
julia> @time f_simd_fast(100_000_000)
0.145078 seconds
1.713649346570416
julia> @time f_avx(100_000_000)
0.082160 seconds
1.7136493465698832
Now all of these are significantly faster than the Fortran code. One thing that helps them is the inlining: there are a lot of constants used for evaluating sin
(polynomial kernels). When inlined, these constant loads can be hoisted out of the loop.
But the gfortran
version also didn’t show the expected speedup when moving to vectors of length 8.
As an aside, my ifort that I downloaded with oneapi was throwing errors (unable to find libm and friends), otherwise I’d have tried that too. The Intel compilers should be better about SIMD-ing special functions than gcc.
Among the things that LoopVectorization does is dispatch supported special functions on SIMD-friendly versions (spread across VectorizationBase.jl and SLEEFPirates.jl).
One thing needed is a vector math library to call. GCC uses GLIBC, Flang libpgmath (I haven’t actually double checked that it contains SIMD implementations), and LLVM otherwise doesn’t use one unless you specify -mveclibabi=
and link the appropriate library. Of course, they also require appropriate flags.
Note that doing this won’t work for Julia, because Julia implemetns sin
, exp
, etc in Julia rather than by calling @llvm.sin
& friends. However, this also makes inlining easier, which can really help some benchmarks.