Julia gets mentioned in an article about FORTRAN

I know that history, but at least here I know it’s the case because we see simd ivdep and such missing from Julia broadcast loops all of the time. We ended up creating an alternative broadcast that forces it to be true:

https://github.com/SciML/DiffEqBase.jl/blob/master/src/diffeqfastbc.jl#L55-L57

This was required to get DiffEq to match what @Elrod was seeing was optimal. In Julia v1.6 there were some regressions in this too, so @YingboMa and @Elrod created another deep workaround:

So at least the issues I’ve been chasing all mostly seem to come from inability to prove these properties. @dextorious had a few issues into Julia Base which highlighted a few other cases as well. I think the main issue is, restrict wasn’t quite necessary because C does at least okay at proving it, while Julia is a little less reliable at proving it which makes it a bigger deal. (And in DiffEq, the majority of external codes we benchmark against are Fortran or Sundials, so that gives us a very high bar)

5 Likes

Let’s first do just single core performance. Here is how I would write it in Fortran:

program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r

call cpu_time(t1)
r = f(100000000)
call cpu_time(t2)

print *, "Time", t2-t1
print *, r

contains

    real(dp) function f(N) result(r)
    integer, intent(in) :: N
    integer :: i
    r = 0
    do i = 1, N
        r = r + sin(real(i,dp))
    end do
    end function

end program

Compile and run (gfortran 9.3.0):

$ gfortran -Ofast avx.f90
$ ./a.out
 Time   1.4622860000000000     
   1.7136493465705178     

Then compare to pure Julia (1.6.1) first:

function f(N)
    s = 0.0
    for i in 1:N
        s += sin(i)
    end
    s
end

@time r = f(100000000)
println(r)

Compile and run:

$ julia f.jl
  2.784782 seconds (1 allocation: 16 bytes)
1.7136493465703402

So the Fortran code executes 1.9x faster than Julia. I checked the assembly and neither Julia nor gfortran generates AVX instructions for some reason (both are using the xmm* registers). So the comparison should be fair. Why cannot Julia generate AVX instructions by default? I don’t know why gfortran does not.

Also note that the speed of compilation+run for N=10 for the Fortran version is about 0.162s:

$ time (gfortran -Ofast avx.f90 && ./a.out)
 Time   9.9999999999969905E-007
   1.4111883712180107     
( gfortran -Ofast avx.f90 && ./a.out; )  0.08s user 0.04s system 73% cpu 0.162 total

While for Julia it is 0.484s:

$ time julia f.jl
  0.000004 seconds (1 allocation: 16 bytes)
1.4111883712180104
julia f.jl  1.03s user 0.20s system 253% cpu 0.484 total

So Julia is 3x slower to compile. I assume it is the slow startup time or something. But this is the other aspect of tooling and user experience.

Now let’s use the @avxt macro.

using LoopVectorization

function f_avx(N)
    s = 0.0
    @avxt for i in 1:N
        s += sin(i)
    end
    s
end

@time r = f_avx(100000000)
println(r)

Compile and run:

$ julia avx.jl
  0.185562 seconds (1 allocation: 16 bytes)
1.713649346570267

Things got 15x faster than the pure Julia version and about 7.9x faster than the Fortran version.

@Elrod do you know if there is a reason why both Julia and Fortran couldn’t generate the fast AVX version by default? As a user that is what I would want.

My Julia version:

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
10 Likes

It presumably is the same problem as the one with getting exp to auto-vectorize that was discussed above.

Note that you can use @avx if you want to compare with LoopVectorization.jl’s single threaded performance (rather than the multi-threaded @avxt)

With @avx instead of @avxt I get on my machine:

$ julia avx.jl
  0.206548 seconds (1 allocation: 16 bytes)
1.713649346570267

Just to complement, @avxt is muti-threaded, and you must run julia with julia -t4 avx.jl to use 4 threads then:

% julia avx.jl 
  0.247981 seconds (1 allocation: 16 bytes)
1.713649346570267

% julia -t4 avx.jl 
  0.062149 seconds (1 allocation: 16 bytes)
1.713649346570008

11 Likes

Ah cool! Very nice. The parallel version can then be compared against do concurrent or OpenMP in Fortran.

4 Likes

Note that xmm instruction that end in ___p_ (i.e., p is the second last letter), they are SIMD, although just 128 bits. Godbolt is great for looking at this.
Using your flags:

.L2:
        movdqa  xmm1, xmm2
        paddd   xmm2, XMMWORD PTR .LC4[rip]
        movaps  XMMWORD PTR [rsp+48], xmm3
        add     ebx, 1
        movaps  XMMWORD PTR [rsp+16], xmm1
        cvtdq2pd        xmm0, xmm1
        movaps  XMMWORD PTR [rsp+32], xmm2
        call    _ZGVbN2v_sin
        movdqa  xmm1, XMMWORD PTR [rsp+16]
        movaps  XMMWORD PTR [rsp], xmm0
        pshufd  xmm0, xmm1, 238
        cvtdq2pd        xmm0, xmm0
        call    _ZGVbN2v_sin
        addpd   xmm0, XMMWORD PTR [rsp]
        movapd  xmm3, XMMWORD PTR [rsp+48]
        cmp     ebx, 24999999
        movdqa  xmm2, XMMWORD PTR [rsp+32]
        addpd   xmm3, xmm0
        jne     .L2

These are “pd” or “packed double” operations. The call itself is to call _ZGVbN2v_sin, which is the mangled glibc name for a vector of 2. So the 1.9x faster you saw is about what what we should expect.

For wider SIMD, you’re missing the appropriate -march flag. My CPU is a cascadelake, so I added -march=cascadelake:

.L2:
        vmovdqa64       zmm1, zmm2
        vpaddd  zmm2, zmm2, ZMMWORD PTR .LC4[rip]
        vmovapd ZMMWORD PTR [rbp-880], zmm3
        vmovdqa64       ZMMWORD PTR [rbp-816], zmm2
        vmovdqa64       ZMMWORD PTR [rbp-752], zmm1
        vcvtdq2pd       zmm0, ymm1
        call    _ZGVeN8v_sin
        vmovdqa64       zmm1, ZMMWORD PTR [rbp-752]
        vmovapd ZMMWORD PTR [rbp-688], zmm0
        vextracti32x8   ymm1, zmm1, 0x1
        vcvtdq2pd       zmm0, ymm1
        call    _ZGVeN8v_sin
        vaddpd  zmm0, zmm0, ZMMWORD PTR [rbp-688]
        vmovapd zmm3, ZMMWORD PTR [rbp-880]
        inc     ebx
        cmp     ebx, 6249999
        vaddpd  zmm3, zmm3, zmm0
        vmovdqa64       zmm2, ZMMWORD PTR [rbp-816]
        jne     .L2

Now we get zmm (512) bit registers, and a call to call _ZGVeN8v_sin. I also added gfortran 11 output to the Godbolt, as that is the version I am on.

gfortran --version                                                                                                                                                                                                                                                                                        (base)
GNU Fortran (Clear Linux OS for Intel Architecture) 11.1.1 20210505 releases/gcc-11.1.0-64-ge9a8d6852c
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

For some reason, performance is bad:

> gfortran -Ofast -march=native -mprefer-vector-width=512 avxsin.f90 -S -o avxsin.s
> gfortran -Ofast -march=native -mprefer-vector-width=512 avxsin.f90
> ./a.out
 Time   1.1339029999999999
   1.7136493465704397

I confirmed I also had the zmm instructions, but it’s only about 2.4x faster, not 7x+ like I’d expect.

julia> function f(N)
           s = 0.0
           for i in 1:N
               s += sin(i)
           end
           s
       end
f (generic function with 1 method)

julia> @time r = f(100000000)
  2.744642 seconds (1 allocation: 16 bytes)
1.7136493465703402

Changing up the march flag:

> gfortran -Ofast avxsin.f90
> ./a.out
 Time   1.7568840000000001
   1.7136493465696847
> gfortran -Ofast -march=skylake avxsin.f90
> ./a.out
 Time   1.2096160000000000
   1.7136493465701701
> gfortran -Ofast -march=skylake-avx512 avxsin.f90
> ./a.out
 Time   1.1435370000000000
   1.7136493465701701
> gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 avxsin.f90
> ./a.out
 Time   1.1281299999999999
   1.7136493465704397

Yes, Julia still has high latency. Starting Julia by itself already takes a couple hundred milliseconds:

> time julia -O3 -C"native,-prefer-256-bit" -q --startup=no -E "1+1"                                                                                                                                                                                                    (base)
2

________________________________________________________
Executed in  114.51 millis    fish           external
   usr time   77.36 millis  653.00 micros   76.71 millis
   sys time  125.27 millis    0.00 micros  125.27 millis

so this does heavily shape people’s workflows, i.e. primarily working out of long living Julia sessions, e.g. REPLs or notebooks. Popular tooling like Revise.jl reflect this.

Note that Julia starts with 1 thread by default, so your @avxt version was probably also single threaded.

The sin implementation doesn’t use a table, but there are a few issues.

Julia’s special functions are normally not inlined. Inlining happens in Julia’s compiler, before handing things off to LLVM, and LLVM normally makes the vectorization decisions (e.g., when using @simd); when faced with a non-intrinsic call, and forbidden from ininling things itself, there’s nothing LLVM can do.

Another is that the implemetnations often aren’t SIMD friendly. Branches should be minimal. SLEEFPirates.jl provides branchless special functions that are also marked @inline so that @simd will vectorize them:

julia> function f_avx(N)
           s = 0.0
           @avx for i in 1:N
               s += sin(i)
           end
           s
       end
f_avx (generic function with 1 method)

julia> function f_simd(N)
           s = 0.0
           @simd for i in 1:N
               s += SLEEFPirates.sin(Float64(i))
           end
           s
       end
f_simd (generic function with 1 method)

julia> function f_simd_fast(N)
           s = 0.0
           @simd for i in 1:N
               s += SLEEFPirates.sin_fast(Float64(i))
           end
           s
       end
f_simd_fast (generic function with 1 method)

julia> @time f_simd(100_000_000)
  0.266950 seconds
1.7136493465702665

julia> @time f_simd_fast(100_000_000)
  0.145078 seconds
1.713649346570416

julia> @time f_avx(100_000_000)
  0.082160 seconds
1.7136493465698832

Now all of these are significantly faster than the Fortran code. One thing that helps them is the inlining: there are a lot of constants used for evaluating sin (polynomial kernels). When inlined, these constant loads can be hoisted out of the loop.
But the gfortran version also didn’t show the expected speedup when moving to vectors of length 8.

As an aside, my ifort that I downloaded with oneapi was throwing errors (unable to find libm and friends), otherwise I’d have tried that too. The Intel compilers should be better about SIMD-ing special functions than gcc.

Among the things that LoopVectorization does is dispatch supported special functions on SIMD-friendly versions (spread across VectorizationBase.jl and SLEEFPirates.jl).
One thing needed is a vector math library to call. GCC uses GLIBC, Flang libpgmath (I haven’t actually double checked that it contains SIMD implementations), and LLVM otherwise doesn’t use one unless you specify -mveclibabi= and link the appropriate library. Of course, they also require appropriate flags.
Note that doing this won’t work for Julia, because Julia implemetns sin, exp, etc in Julia rather than by calling @llvm.sin & friends. However, this also makes inlining easier, which can really help some benchmarks.

21 Likes

Addendum to my above comment, relative errors:

julia> correct = @time f(big(100_000_000))
394.399953 seconds (2.63 G allocations: 69.062 GiB, 4.56% gc time)
1.713649346570575845989725326488332301095244205920242958366963619602202176739779

julia> Float64(f_simd(100_000_000) - correct)
-3.0938307061767365e-13

julia> Float64(f_simd_fast(100_000_000) - correct)
-1.5994705150312756e-13

julia> Float64(f_avx(100_000_000) - correct)
-6.926320587182777e-13

julia> Float64(f(100_000_000) - correct)
-2.3566426178256326e-13

LoopVectorization’s threads should be low overhead, with the overhead being well under 1 microsecond if used within 1 millisecond of the last code threaded by LoopVectorization/Octavian/CheapThreads.

julia> function f_avx(N)
           s = 0.0
           @avx for i in 1:N
               s += sin(i)
           end
           s
       end
f_avx (generic function with 1 method)

julia> function f_avxt(N)
           s = 0.0
           @avxt for i in 1:N
               s += sin(i)
           end
           s
       end
f_avxt (generic function with 1 method)

julia> @btime f_avx(1_000)
  830.351 ns (0 allocations: 0 bytes)
0.813969634073166

julia> @btime f_avxt(1_000)
  554.457 ns (0 allocations: 0 bytes)
0.8139696340731665
3 Likes

Pay attention that those are compiler extensions (Though Intel’s Compiler shares them with LLVM [Well now the modern Intel Compiler is actually LLVM]).
OpenMP 4 added some pragma dedicated to SIMD which are supported by any compiler with OpenMP support (All of them but MSVC at the moment, though they seem to embrace LLVM’s OpenMP which is contributed by Intel). So it might be nice to embrace their syntax.

Thanks for the detailed analysis. I believe the -march=native should select the processor and I did use it, but then I changed to -Ofast and didn’t realize that it does not include -march=native, but timings didn’t change much, as you also noticed.

I also asked here:

where we tried different Fortran compilers and gfortran does not always perform the best. It also depends on the sin function implementation etc. But it seems that one can get within a factor of 2x to the @avx Julia implementation with a better compiler (which is much better than 40x).

Historically, the Intel compiler typically produced the fastest Fortran code, so that’s what most people I believe use for production. GFortran is nice because it is open source and good enough and has been steadily improving.

One of the original missions of Fortran was to extract a high percentage of performance from the underlying hardware, and I think historically that has been the case, but lately the Fortran compilers might be a bit behind on this mission. One of the motivations for me to start LFortran was to also be able to look into these performance issues down the road. As a user I would expect that by default it should simply use the widest SIMD instructions, loop properly unrolled, as you mentioned, part of the work is to have properly vectorized versions of elementary functions, etc.

11 Likes

Have you tried implementing Chris Elrod there?

Seriously now, when testing, note that those tests are measuring compilation time (which is small for this simple function), since compilation occurs on the first call to every method. A better measure of performance you will get with the @btime macro from BenchmarkTools:

using BenchmarkTools
@btime f(1000000)

or

@benchmark f(1000000)

for more details.

2 Likes

I’ve heard that the build times are horrendous, it took a few decades before ours was usable, but I guess the results speak for themselves.

31 Likes

Pointer alias is a valid point, but other languages can also get around that. For example, Rust ensures every &mut is non-aliasing, since at any time we can only have at most one mutable borrower.
Even if we don’t have restrict or &mut, we can prove the desiring property with pointer analysis, especially the interprocedural one. Of course, it will take a long time to get precise enough, but totally doable. I guess Julia can’t spend too much time on these analysis since compilation time is already a huge problem…

That is certainly a valid use case, but “alive” can mean a lot of different things for code.

The ideal situation is when necessary changes can be implemented quickly: multiple people understand what the code does and how to modify it, there is a mature and comprehensive suite of unit tests, and the code itself is well-organized and documented.

While some Fortran code works like that, the majority if have seen in legacy applications is written in Fortran 77 (at best, F9x), with functions spanning hundreds of lines, forming a dense, nonlinear spaghetti of GOTOs, documenting the calling conventions but not the internals.

And the person who wrote it has left the company/research unit two decades ago, is long retired, and either dead or prefers angling on a quiet lakeshore to writing Fortran 77.

This is not to pick on F77. Programming culture has changed a lot since that code was written, and a dealing with legacy code was less of an issue because languages didn’t last that long. In fact, Fortran is one of the oldest programming languages still in use, and for a good reason.

The best you can do in those cases having a foreign function interface that makes it easy to wrap code like this, then, if it makes sense economically, port the unit tests and then gradually the whole codebase.

1 Like

Thanks. I ran @time several times and took the best timing. I think that would ensure the compilation time is not included. Or am I wrong?

That’s fine. That’s what @btime does for you.

3 Likes

Referencing an earlier post about using Julia in restricted environments:

I encountered something similar when I tried to write a custom AWS Lambda runtime for Julia. I don’t recall if I used ApplicationBulder or PackageCompiler, so I can’t be as specific as I would like to be, but a simple hello world was too big to fit into the available storage for a custom runtime

1 Like

Huh, I must be missing something here. It seems you are calling sin() 100,000,000 times, which suggests to me that the cost of adding the various results is a small percentage of the computation.
The computation of sin(1), sin(2), etc can be done using the formula for sin(x+y), all you need is to compute sin(1) and cos(1) and some arithmetic. sin(2) is 2*cos(1)*sin(1) for instance.

While adding 100,000,000 numbers will give you some kind of timing data, the sum may be substantially inaccurate because of cancellation, roundoff, etc. (look up Kahan summation).

There are plenty of scientific computing benchmarks in the open literature (in FORTRAN, Matlab, C, …) that might be coded in Julia [if they are not already available in Julia] that would make a more useful comparison. Maybe See Lies, damned lies, and benchmarks: What makes a good performance metric – GfxSpeak
The Julia benchmarks I was able to easily spot (via Google) were kind of micro-optimization claims, but maybe there’s more?

1 Like

I was the one originally posting this test, and the purpose was only to compare the possible overhead of threading in comparison to a OpenMP. The reason for using a sin there is because with that the compiler cannot trick us by determining the result without computing the sum.

1 Like

@dcgrigsby please look at this thread. The limits on lambda have changed. The report here is that the Julia image size is around 500mb

1 Like