Julia gets mentioned in an article about FORTRAN

Note that xmm instruction that end in ___p_ (i.e., p is the second last letter), they are SIMD, although just 128 bits. Godbolt is great for looking at this.
Using your flags:

.L2:
        movdqa  xmm1, xmm2
        paddd   xmm2, XMMWORD PTR .LC4[rip]
        movaps  XMMWORD PTR [rsp+48], xmm3
        add     ebx, 1
        movaps  XMMWORD PTR [rsp+16], xmm1
        cvtdq2pd        xmm0, xmm1
        movaps  XMMWORD PTR [rsp+32], xmm2
        call    _ZGVbN2v_sin
        movdqa  xmm1, XMMWORD PTR [rsp+16]
        movaps  XMMWORD PTR [rsp], xmm0
        pshufd  xmm0, xmm1, 238
        cvtdq2pd        xmm0, xmm0
        call    _ZGVbN2v_sin
        addpd   xmm0, XMMWORD PTR [rsp]
        movapd  xmm3, XMMWORD PTR [rsp+48]
        cmp     ebx, 24999999
        movdqa  xmm2, XMMWORD PTR [rsp+32]
        addpd   xmm3, xmm0
        jne     .L2

These are “pd” or “packed double” operations. The call itself is to call _ZGVbN2v_sin, which is the mangled glibc name for a vector of 2. So the 1.9x faster you saw is about what what we should expect.

For wider SIMD, you’re missing the appropriate -march flag. My CPU is a cascadelake, so I added -march=cascadelake:

.L2:
        vmovdqa64       zmm1, zmm2
        vpaddd  zmm2, zmm2, ZMMWORD PTR .LC4[rip]
        vmovapd ZMMWORD PTR [rbp-880], zmm3
        vmovdqa64       ZMMWORD PTR [rbp-816], zmm2
        vmovdqa64       ZMMWORD PTR [rbp-752], zmm1
        vcvtdq2pd       zmm0, ymm1
        call    _ZGVeN8v_sin
        vmovdqa64       zmm1, ZMMWORD PTR [rbp-752]
        vmovapd ZMMWORD PTR [rbp-688], zmm0
        vextracti32x8   ymm1, zmm1, 0x1
        vcvtdq2pd       zmm0, ymm1
        call    _ZGVeN8v_sin
        vaddpd  zmm0, zmm0, ZMMWORD PTR [rbp-688]
        vmovapd zmm3, ZMMWORD PTR [rbp-880]
        inc     ebx
        cmp     ebx, 6249999
        vaddpd  zmm3, zmm3, zmm0
        vmovdqa64       zmm2, ZMMWORD PTR [rbp-816]
        jne     .L2

Now we get zmm (512) bit registers, and a call to call _ZGVeN8v_sin. I also added gfortran 11 output to the Godbolt, as that is the version I am on.

gfortran --version                                                                                                                                                                                                                                                                                        (base)
GNU Fortran (Clear Linux OS for Intel Architecture) 11.1.1 20210505 releases/gcc-11.1.0-64-ge9a8d6852c
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

For some reason, performance is bad:

> gfortran -Ofast -march=native -mprefer-vector-width=512 avxsin.f90 -S -o avxsin.s
> gfortran -Ofast -march=native -mprefer-vector-width=512 avxsin.f90
> ./a.out
 Time   1.1339029999999999
   1.7136493465704397

I confirmed I also had the zmm instructions, but it’s only about 2.4x faster, not 7x+ like I’d expect.

julia> function f(N)
           s = 0.0
           for i in 1:N
               s += sin(i)
           end
           s
       end
f (generic function with 1 method)

julia> @time r = f(100000000)
  2.744642 seconds (1 allocation: 16 bytes)
1.7136493465703402

Changing up the march flag:

> gfortran -Ofast avxsin.f90
> ./a.out
 Time   1.7568840000000001
   1.7136493465696847
> gfortran -Ofast -march=skylake avxsin.f90
> ./a.out
 Time   1.2096160000000000
   1.7136493465701701
> gfortran -Ofast -march=skylake-avx512 avxsin.f90
> ./a.out
 Time   1.1435370000000000
   1.7136493465701701
> gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 avxsin.f90
> ./a.out
 Time   1.1281299999999999
   1.7136493465704397

Yes, Julia still has high latency. Starting Julia by itself already takes a couple hundred milliseconds:

> time julia -O3 -C"native,-prefer-256-bit" -q --startup=no -E "1+1"                                                                                                                                                                                                    (base)
2

________________________________________________________
Executed in  114.51 millis    fish           external
   usr time   77.36 millis  653.00 micros   76.71 millis
   sys time  125.27 millis    0.00 micros  125.27 millis

so this does heavily shape people’s workflows, i.e. primarily working out of long living Julia sessions, e.g. REPLs or notebooks. Popular tooling like Revise.jl reflect this.

Note that Julia starts with 1 thread by default, so your @avxt version was probably also single threaded.

The sin implementation doesn’t use a table, but there are a few issues.

Julia’s special functions are normally not inlined. Inlining happens in Julia’s compiler, before handing things off to LLVM, and LLVM normally makes the vectorization decisions (e.g., when using @simd); when faced with a non-intrinsic call, and forbidden from ininling things itself, there’s nothing LLVM can do.

Another is that the implemetnations often aren’t SIMD friendly. Branches should be minimal. SLEEFPirates.jl provides branchless special functions that are also marked @inline so that @simd will vectorize them:

julia> function f_avx(N)
           s = 0.0
           @avx for i in 1:N
               s += sin(i)
           end
           s
       end
f_avx (generic function with 1 method)

julia> function f_simd(N)
           s = 0.0
           @simd for i in 1:N
               s += SLEEFPirates.sin(Float64(i))
           end
           s
       end
f_simd (generic function with 1 method)

julia> function f_simd_fast(N)
           s = 0.0
           @simd for i in 1:N
               s += SLEEFPirates.sin_fast(Float64(i))
           end
           s
       end
f_simd_fast (generic function with 1 method)

julia> @time f_simd(100_000_000)
  0.266950 seconds
1.7136493465702665

julia> @time f_simd_fast(100_000_000)
  0.145078 seconds
1.713649346570416

julia> @time f_avx(100_000_000)
  0.082160 seconds
1.7136493465698832

Now all of these are significantly faster than the Fortran code. One thing that helps them is the inlining: there are a lot of constants used for evaluating sin (polynomial kernels). When inlined, these constant loads can be hoisted out of the loop.
But the gfortran version also didn’t show the expected speedup when moving to vectors of length 8.

As an aside, my ifort that I downloaded with oneapi was throwing errors (unable to find libm and friends), otherwise I’d have tried that too. The Intel compilers should be better about SIMD-ing special functions than gcc.

Among the things that LoopVectorization does is dispatch supported special functions on SIMD-friendly versions (spread across VectorizationBase.jl and SLEEFPirates.jl).
One thing needed is a vector math library to call. GCC uses GLIBC, Flang libpgmath (I haven’t actually double checked that it contains SIMD implementations), and LLVM otherwise doesn’t use one unless you specify -mveclibabi= and link the appropriate library. Of course, they also require appropriate flags.
Note that doing this won’t work for Julia, because Julia implemetns sin, exp, etc in Julia rather than by calling @llvm.sin & friends. However, this also makes inlining easier, which can really help some benchmarks.

21 Likes