Julia gets mentioned in an article about FORTRAN

The last time I looked at it was

but it would be awesome to have more up to date benchmarks.

1 Like

Here is one example (not sure how representative it is):

Fortran:

Code
PROGRAM Parallel_Hello_World
USE OMP_LIB
REAL*8 :: partial_Sum, total_Sum
!$OMP PARALLEL PRIVATE(partial_Sum) SHARED(total_Sum)
    partial_Sum = 0.d0
    total_Sum = 0.d0

    !$OMP DO
    DO i=1,1000000000
        partial_Sum = partial_Sum + dsin(dble(i))
    END DO
    !$OMP END DO

    !$OMP CRITICAL
        total_Sum = total_Sum + partial_Sum
    !$OMP END CRITICAL

!$OMP END PARALLEL
PRINT *, "Total Sum: ", total_Sum
END

Result:

leandro@pitico:~/Drive/Work/JuliaPlay% gfortran -O3 omp.f95 -o omp
leandro@pitico:~/Drive/Work/JuliaPlay% time ./omp
 Total Sum:   0.42129448674541342

real    0m54,495s
user    0m54,476s
sys     0m0,004s
leandro@pitico:~/Drive/Work/JuliaPlay% gfortran -O3 -fopenmp omp.f95 -o omp
leandro@pitico:~/Drive/Work/JuliaPlay% time ./omp
 Total Sum:   0.42129448674645509

real    0m12,538s
user    1m29,020s
sys     0m0,016s

Julia using FLoops:

Code
using FLoops
function f()
    @floop for i in 1:1000000000
        @reduce(total_Sum += sin(i))
    end
    total_Sum
end
println("Total sum: ",f())

Result:

leandro@pitico:~/Drive/Work/JuliaPlay% time julia floop.jl
Total sum: 0.4212944867466973

real    0m40,729s
user    0m41,039s
sys     0m0,608s

leandro@pitico:~/Drive/Work/JuliaPlay% time julia -t8 floop.jl
Total sum: 0.4212944867465145

real    0m14,743s
user    1m4,074s
sys     0m0,751s

Julia using @threads:

using Base.Threads
function f()
    total_Sum = zeros(nthreads())
    @threads for i in 1:1000000000
        total_Sum[threadid()] += sin(i)
    end
    sum(total_Sum)
end
println("Total sum: ",f())
leandro@pitico:~/Drive/Work/JuliaPlay% time julia -t8 threads.jl 
Total sum: 0.4212944867465146

real    0m10,514s
user    1m11,253s
sys     0m0,544s

Note that Julia is faster than Fortran with a single thread (no idea why).

I split this in a new thread, I guess I don’t really understand what is going on when using atomic operations, and now removed the initial test, where I used atomic_add! here because it didn’t make sense.

3 Likes

I read those posts and it is still not clear to me why someone who is not working with a ton of Fortran legacy code already would start with the language now, even if you manage all the ambitious goals of LFortran.

Don’t get me wrong, I think that Fortran was a terrific language for its time, but most of its key ingredients have been adapted by other languages since. Eg all the highlights you list were already in Julia from day 0, except for generalized indexing which was added later, but required no fundamental redesign of anything deep.

I would tend agree with Thomas Clune quoted in the article and @ChrisRackauckas above: aggressively devirtualized multiple dispatch is the game changer for Julia. Having a REPL, macros, and other nice things are just there because it would have been weird to design a language without them after the 1990s, just like seatbelts and later ABS became mandatory in cars — it is to obvious a benefit to miss.

9 Likes

Obviously, I cannot speak for Certik, but I thought I would mention that there’s a huge pressure on the employees of the national labs to keep Fortran codes alive. The national labs have huge investments in massive codes in Fortran, and whatever keeps them going is well rewarded.

12 Likes

Lack of aliasing makes Fortran a little hard to write, but it makes it easier for the compiler to optimize a few things. When aliasing is allowed, you need a bunch of analyses to figure out whether two bindings could ever be pointing to the same array (which may eliminate some things like simd ivdep), while you can those for free if the language does not allow that to ever occur. Moving to a dynamic language like Julia makes that even harder than proving it in a language like C, and so the common “I got within 2x of C for a direct translation but getting to 1-1 is hard” usually ends up boiling down to whether aliasing or escaping can be proved.

6 Likes

Except one.

Being able to produce small (or not that small) stand-alone executables. IMO this is a major feature that non-interactive language users have troubles to comprehend.

I wish I could upvote this twice. In my opinion this is a major disadvantage of Julia compared to other compiled languages.

For example, consider the Pixhawk hardware which is one of the most ubiquitous platforms to build autonomous planes, cars, boats, etc. on. It has at most 2 MB of flash memory. By some accounts a simple “Hello World” binary made using PackageCompiler.jl can be in excess of 100 MB. Perhaps this size can be reduced, but I’ve never seen any documentation or methodology on how to do so.

I think Julia is one of the most elegant and powerful languages available today. What’s been done with Julia in packages such as the SciML framework is astounding. But unless Julia gains the ability to compile compact standalone binaries, there will always be the need in domains such as robotics and autonomy to also use a second compiled language for final implementation.

10 Likes

Maybe there is where a LLVM Fortran implementation and Julia can finally converge? If one could write a fortran function like (to be simplistic)

function f(x)
  implicit none
  double precision :: y, x
  do i = 1, 10
    y = 2*x + y
  end
  return y
end

and call it directly from Julia, what would be the difference to a Julia code? The fact that inside that function aliasing is not allowed and some very minor syntactic differences. If there was a macro to avoid aliasing in Julia and make “implicit one”, such that one had to explicit define the scope of the variable, like:

@no_aliasing f(x)
  local y::Float64, x::Float64
  for i in 1:10
    y = 2*x + y
  end
  return y
end

That would be just like Fortran, and that could be a final optimization for some very performance-sensitive code. Of course such a function can be statically compiled, as everything is known. If LLVM can understand the Fortran function, it certainly could interpret the above “Julia” function in the same way.

Probably there is not much push to have such a feature because there are other strategies which experienced Julia programmers find to achieve that last 1:1 to C performance at the end, without needing static types (not that I am a parameter for anything, but even I managed to do it the programs I write for my research).

1 Like

There’s an issue open for this IIRC.

2 Likes

Probably this one: feature request: local static variables · Issue #15056 · JuliaLang/julia · GitHub

(lots of related requests, but it seems that at the end the specific problems were always solved otherwise)

I used to hear this all the time as a reason that Fortran was purportedly better than C for high-performance code. Then in 1999, C finally got the restrict keyword, which erased this supposedly huge advantage … but hardly anyone seems to use it because the benefits turn out to be marginal at best in most real-world cases, and they still haven’t bothered to officially adopt it for C++ (though many compilers support it as an extension).

This is not the “aliasing” that Chris was referring to. He was referring to aliasing where two pointers (e.g. two arrays) refer to the same memory under the hood, which is disallowed by Fortran and in principle this can allow additional compiler optimizations in some cases. (In Julia parlance, it’s about multiple references to the same instance of a mutable type.)

(In contrast, alias analysis is irrelevant for variables of immutable types, as in the code you posted above working with Float64 variables.)

9 Likes

Yes, my example was too simplistic, I was trying to say that a label is always bound to the same variable (mutable or immutable), given the declarations. I understand (after learning here) that for immutables this distinction does not even make sense.

(probably what you said is even more precise, there is some abstraction to get the difference between the pointer and the label, but from a Fortran user perspective I think they are pretty much the same).

How does LFortran compare with Cling, the interactive C++ interpreter? Personally, I tried the latter in Jupyter for 5 minutes before giving up, because it doesn’t support structured bindings introduced by C++17 (similar to multple assignments in Python and Julia), which is apparently an old unresolved bug in Cling. Maybe C++ features are growing too fast for them to keep up.

LoopVectorization.jl is hard to beat for examples like that:

julia> using FLoops, Base.Threads, LoopVectorization, BenchmarkTools

julia> function f_threads(N)
         total_Sum = zeros(nthreads())
         @threads for i in 1:N
           total_Sum[threadid()] += sin(i)
         end
         sum(total_Sum)
       end
f_threads (generic function with 1 method)

julia> function f_floop(N)
         @floop for i in 1:N
           @reduce(total_Sum += sin(i))
         end
         total_Sum
       end
f_floop (generic function with 1 method)

julia> function f_avx(N)
         s = 0.0
         @avxt for i in 1:N
           s += sin(i)
         end
         s
       end
f_avx (generic function with 1 method)

julia> const LIBOMPHELLO = "/home/chriselrod/Documents/progwork/fortran/libparallelhelloworld.so";

julia> f_omp(N::Int) = @ccall LIBOMPHELLO.sin_range_sum(N::Ref{Clong})::Float64
f_omp (generic function with 1 method)

julia> @time f_omp(100)
  0.000891 seconds
-0.12717101366041927

julia> @time f_threads(100)
  0.099024 seconds (359.03 k allocations: 20.958 MiB, 14.34% compilation time)
-0.12717101366042005

julia> @time f_floop(100)
┌ Warning: use values(kwargs) and keys(kwargs) instead of kwargs.data and kwargs.itr
│   caller = #AdHocRF#105 at combinators.jl:249 [inlined]
└ @ Core ~/.julia/packages/Transducers/oLdU1/src/combinators.jl:249
┌ Warning: use values(kwargs) and keys(kwargs) instead of kwargs.data and kwargs.itr
│   caller = #_#114 at executors.jl:46 [inlined]
└ @ Core ~/.julia/packages/Transducers/oLdU1/src/executors.jl:46
  0.562706 seconds (2.29 M allocations: 128.427 MiB, 5.08% gc time, 3.53% compilation time)
-0.12717101366042072

julia> @time f_avx(100)
  0.000020 seconds
-0.12717101366042005

julia> sleep(1);

julia> N = 1000000000
1000000000

julia> @time f_omp(N)
  2.205502 seconds
0.42129448674486225

julia> @btime f_omp($N)
  2.181 s (0 allocations: 0 bytes)
0.4212944867448605

julia> @time f_threads(N)
  5.049253 seconds (190 allocations: 16.703 KiB)
0.42129448674422765

julia> @time f_floop(N)
  1.272356 seconds (693 allocations: 55.703 KiB)
0.42129448674746617

julia> @btime f_floop($N)
  1.272 s (711 allocations: 57.05 KiB)
0.42129448674746617

julia> @time f_avx(N)
  0.077365 seconds (103 allocations: 6.938 KiB, 3.85% compilation time)
0.4212944867467526

julia> @btime f_avx($N)
  45.923 ms (0 allocations: 0 bytes)
0.4212944867467526

julia> versioninfo()
Julia Version 1.7.0-DEV.1052
Commit cc59947c03* (2021-05-02 02:36 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

Thats 46 milliseconds for the @avxt version, vs >2 seconds for the OMP Fortran, for more than a 40x improvement.

Fortran code:

Summary
module Parallel_Hello_World

  use ISO_C_BINDING
  USE OMP_LIB
  
  implicit none

  contains

    real(C_double) function sin_range_sum(N) BIND(C, name="sin_range_sum")
      integer(C_long), intent(in) :: N
      real(C_double)  :: s
      integer(C_long) :: i
      s = 0d0
!$omp parallel do reduction(+: s)
      do i = 1,N
         s = s + dsin(dble(i))
      end do
      sin_range_sum = s
    end function sin_range_sum

end module Parallel_Hello_World

Like in the above demo, many of LoopVectorization’s benchmarks are examples of small benchmarks where “pure Julia” code has a marked advantage.
When it comes to extracting all the performance you can get, Julia’s metaprogramming facilities and multipledispatch are a great tool, allowing for simple looking code that expresses the intent/behavior, while allowing arbitrary transformations under the hood to implement that behavior as quickly as possible.

Two options from Base Julia:
@simd ivdep, and Base.Experimental.@aliasscope with Const on Arrays

help?> Base.Experimental.@aliasscope
  @aliasscope expr


  Allows the compiler to assume that all Consts are not being modified through stores within this scope, even if the compiler can't prove this to be the case.

  │ Warning
  │
  │  Experimental API. Subject to change without deprecation.

Note that LLVM’s aliasing assumptions require that elements do not alias one another. This is violated by BitArrays, because many bits belong to the same chunk (chunks are UInt64, or 64 bits, under the hood).
Within the package ecosystem, @avx also assumes no aliasing.

To expand on this, every now and then there’s a benchmark where it helps. For example:

julia> function simd_map!(f, y, x)
           @inbounds @simd for i ∈ eachindex(y,x)
               y[i] = f(x[i])
           end
       end
simd_map! (generic function with 1 method)

julia> function simd_ivdep_map!(f, y, x)
           @inbounds @simd ivdep for i ∈ eachindex(y,x)
               y[i] = f(x[i])
           end
       end
simd_ivdep_map! (generic function with 1 method)

julia> using VectorizationBase

julia> x = randn(256); y = similar(x);

julia> @benchmark simd_map!(VectorizationBase.vexp, $y, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     653.994 ns (0.00% GC)
  median time:      691.679 ns (0.00% GC)
  mean time:        692.923 ns (0.00% GC)
  maximum time:     920.909 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     165

julia> @benchmark simd_ivdep_map!(VectorizationBase.vexp, $y, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     130.065 ns (0.00% GC)
  median time:      132.843 ns (0.00% GC)
  mean time:        132.991 ns (0.00% GC)
  maximum time:     175.632 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     891

julia> y ≈ exp.(x)
true

VectorizationBase.vexp, like Base.exp, uses a lookup table. Base.exp however is not marked @inline, so the compiler can’t optimize it here, preventing us from being able to use it as an example.

Because exp is looking up some values from a table, the compiler needs to know that y – the vector our loops are storing into – doesn’t alias that table. If they did alias, each assignment to y could change future loads, meaning that the iterations would have to be done in order, i.e. parallel execution with SIMD wouldn’t yield the correct answer. (Of course, you should never be writing into exp's table, that’d corrupt exp!)

Because the accesses to the table are random, LLVM doesn’t know how to prove they don’t alias with simple pointer comparisons. If accesses were regular, you could just check that the pointers are far enough apart, e.g. at least the register size * unroll factor apart.

Compilers are really good about doing this these days, and a simple check like that is so cheap, that normally it’s hard to notice the difference. It’s just in the tricker cases, like with tables, that it can make a big difference.

19 Likes

Looks interesting, but it is still C++ :frowning:

1 Like

The Base.exp case is especially annoying because it uses a Tuple for the table, so the compiler really should be able to figure out that there isn’t any possibility of aliasing.

The main difference from Cling is that Cling is built on top of Clang while LFortran is a compiler that works both as an ahead of time compiler to binaries, as well as interactively and it has been designed from scratch to be that way. The interactive part is just a relatively small add-on. All features should work both ways.

2 Likes

I know. :frowning:
It’d be good to make an internal exp_inline version that’s marked @inline, and then have exp(x::Union{Float32,Float64}) = exp_inline(x).
That’d make it easier to test/track.

Or at least switch BetterExp to a tuple, so we can use that.
VectorizationBase needs pointers, so it can’t use a tuple itself.

Once we have an inlined tuple version to look at, we should check the TBAA metadata and then file an LLVM bug report. Perhaps that can be fixed, or they can point us to something that’s missing.

3 Likes

Awesome! (as usual…) Note you took a pretty large base case, and I would assume the speedup has more to do with single-core performance than multithreading performance. What I tried to measure some time ago was the threading overhead, i.e. how large must N be for threading to be helpful for a very simple function, and there I found openmp had a large advantage. But that was a while ago, and before loopvectorization and floops…

1 Like

I know that history, but at least here I know it’s the case because we see simd ivdep and such missing from Julia broadcast loops all of the time. We ended up creating an alternative broadcast that forces it to be true:

This was required to get DiffEq to match what @Elrod was seeing was optimal. In Julia v1.6 there were some regressions in this too, so @YingboMa and @Elrod created another deep workaround:

So at least the issues I’ve been chasing all mostly seem to come from inability to prove these properties. @dextorious had a few issues into Julia Base which highlighted a few other cases as well. I think the main issue is, restrict wasn’t quite necessary because C does at least okay at proving it, while Julia is a little less reliable at proving it which makes it a bigger deal. (And in DiffEq, the majority of external codes we benchmark against are Fortran or Sundials, so that gives us a very high bar)

5 Likes

Let’s first do just single core performance. Here is how I would write it in Fortran:

program avx
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: t1, t2, r

call cpu_time(t1)
r = f(100000000)
call cpu_time(t2)

print *, "Time", t2-t1
print *, r

contains

    real(dp) function f(N) result(r)
    integer, intent(in) :: N
    integer :: i
    r = 0
    do i = 1, N
        r = r + sin(real(i,dp))
    end do
    end function

end program

Compile and run (gfortran 9.3.0):

$ gfortran -Ofast avx.f90
$ ./a.out
 Time   1.4622860000000000     
   1.7136493465705178     

Then compare to pure Julia (1.6.1) first:

function f(N)
    s = 0.0
    for i in 1:N
        s += sin(i)
    end
    s
end

@time r = f(100000000)
println(r)

Compile and run:

$ julia f.jl
  2.784782 seconds (1 allocation: 16 bytes)
1.7136493465703402

So the Fortran code executes 1.9x faster than Julia. I checked the assembly and neither Julia nor gfortran generates AVX instructions for some reason (both are using the xmm* registers). So the comparison should be fair. Why cannot Julia generate AVX instructions by default? I don’t know why gfortran does not.

Also note that the speed of compilation+run for N=10 for the Fortran version is about 0.162s:

$ time (gfortran -Ofast avx.f90 && ./a.out)
 Time   9.9999999999969905E-007
   1.4111883712180107     
( gfortran -Ofast avx.f90 && ./a.out; )  0.08s user 0.04s system 73% cpu 0.162 total

While for Julia it is 0.484s:

$ time julia f.jl
  0.000004 seconds (1 allocation: 16 bytes)
1.4111883712180104
julia f.jl  1.03s user 0.20s system 253% cpu 0.484 total

So Julia is 3x slower to compile. I assume it is the slow startup time or something. But this is the other aspect of tooling and user experience.

Now let’s use the @avxt macro.

using LoopVectorization

function f_avx(N)
    s = 0.0
    @avxt for i in 1:N
        s += sin(i)
    end
    s
end

@time r = f_avx(100000000)
println(r)

Compile and run:

$ julia avx.jl
  0.185562 seconds (1 allocation: 16 bytes)
1.713649346570267

Things got 15x faster than the pure Julia version and about 7.9x faster than the Fortran version.

@Elrod do you know if there is a reason why both Julia and Fortran couldn’t generate the fast AVX version by default? As a user that is what I would want.

My Julia version:

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
10 Likes