Julia gets mentioned in an article about FORTRAN

I have the impression that even the developers share this view, but I don’t completely get what is missing from what is implemented already. From a syntatic point of view, using @threads or even packages as FLoops is much easier and cleaner than using OpenMP.

@threads has a known overhead if applied to parallelize fast computations (is that what you and others mean by it is not quite there yet?). FLoops (as an example) solves this for at least quite a few problems I have faced.

My experience, as a user of those tools, is that using threding in Julia is already at least much easier than using OpenMP. But it is likely that my usage is very narrow and I don’t capture what is missing.


but it’s (as of now) slower

Can you provide an example? It’s interesting to have some numbers to understand the size of the problem.

1 Like

I didn’t mean to imply that there’s something wrong with Fortran or OpenMP. I’ve used OpenMP with Fortran, a few times, on very trivially parallelizable code and it was pleasant to work for that case. I was just curious if it’s a part of language such as coarrays and whether it has ability to schedule tasks depth first. Thank you for explanation.

1 Like

Can you give some examples about it? I don’t think Fortran can perform some secret compiler optimizations other languages can’t have. ML and LLVM community also puts really a lot of efforts into specialized compiler optimization, such as vectorization and polyhedral model, which is hard to beat.

1 Like

The last time I looked at it was

but it would be awesome to have more up to date benchmarks.

1 Like

Here is one example (not sure how representative it is):


PROGRAM Parallel_Hello_World
REAL*8 :: partial_Sum, total_Sum
!$OMP PARALLEL PRIVATE(partial_Sum) SHARED(total_Sum)
    partial_Sum = 0.d0
    total_Sum = 0.d0

    !$OMP DO
    DO i=1,1000000000
        partial_Sum = partial_Sum + dsin(dble(i))
    END DO
    !$OMP END DO

        total_Sum = total_Sum + partial_Sum

PRINT *, "Total Sum: ", total_Sum


leandro@pitico:~/Drive/Work/JuliaPlay% gfortran -O3 omp.f95 -o omp
leandro@pitico:~/Drive/Work/JuliaPlay% time ./omp
 Total Sum:   0.42129448674541342

real    0m54,495s
user    0m54,476s
sys     0m0,004s
leandro@pitico:~/Drive/Work/JuliaPlay% gfortran -O3 -fopenmp omp.f95 -o omp
leandro@pitico:~/Drive/Work/JuliaPlay% time ./omp
 Total Sum:   0.42129448674645509

real    0m12,538s
user    1m29,020s
sys     0m0,016s

Julia using FLoops:

using FLoops
function f()
    @floop for i in 1:1000000000
        @reduce(total_Sum += sin(i))
println("Total sum: ",f())


leandro@pitico:~/Drive/Work/JuliaPlay% time julia floop.jl
Total sum: 0.4212944867466973

real    0m40,729s
user    0m41,039s
sys     0m0,608s

leandro@pitico:~/Drive/Work/JuliaPlay% time julia -t8 floop.jl
Total sum: 0.4212944867465145

real    0m14,743s
user    1m4,074s
sys     0m0,751s

Julia using @threads:

using Base.Threads
function f()
    total_Sum = zeros(nthreads())
    @threads for i in 1:1000000000
        total_Sum[threadid()] += sin(i)
println("Total sum: ",f())
leandro@pitico:~/Drive/Work/JuliaPlay% time julia -t8 threads.jl 
Total sum: 0.4212944867465146

real    0m10,514s
user    1m11,253s
sys     0m0,544s

Note that Julia is faster than Fortran with a single thread (no idea why).

I split this in a new thread, I guess I don’t really understand what is going on when using atomic operations, and now removed the initial test, where I used atomic_add! here because it didn’t make sense.


I read those posts and it is still not clear to me why someone who is not working with a ton of Fortran legacy code already would start with the language now, even if you manage all the ambitious goals of LFortran.

Don’t get me wrong, I think that Fortran was a terrific language for its time, but most of its key ingredients have been adapted by other languages since. Eg all the highlights you list were already in Julia from day 0, except for generalized indexing which was added later, but required no fundamental redesign of anything deep.

I would tend agree with Thomas Clune quoted in the article and @ChrisRackauckas above: aggressively devirtualized multiple dispatch is the game changer for Julia. Having a REPL, macros, and other nice things are just there because it would have been weird to design a language without them after the 1990s, just like seatbelts and later ABS became mandatory in cars — it is to obvious a benefit to miss.


Obviously, I cannot speak for Certik, but I thought I would mention that there’s a huge pressure on the employees of the national labs to keep Fortran codes alive. The national labs have huge investments in massive codes in Fortran, and whatever keeps them going is well rewarded.


Lack of aliasing makes Fortran a little hard to write, but it makes it easier for the compiler to optimize a few things. When aliasing is allowed, you need a bunch of analyses to figure out whether two bindings could ever be pointing to the same array (which may eliminate some things like simd ivdep), while you can those for free if the language does not allow that to ever occur. Moving to a dynamic language like Julia makes that even harder than proving it in a language like C, and so the common “I got within 2x of C for a direct translation but getting to 1-1 is hard” usually ends up boiling down to whether aliasing or escaping can be proved.


Except one.

Being able to produce small (or not that small) stand-alone executables. IMO this is a major feature that non-interactive language users have troubles to comprehend.

I wish I could upvote this twice. In my opinion this is a major disadvantage of Julia compared to other compiled languages.

For example, consider the Pixhawk hardware which is one of the most ubiquitous platforms to build autonomous planes, cars, boats, etc. on. It has at most 2 MB of flash memory. By some accounts a simple “Hello World” binary made using PackageCompiler.jl can be in excess of 100 MB. Perhaps this size can be reduced, but I’ve never seen any documentation or methodology on how to do so.

I think Julia is one of the most elegant and powerful languages available today. What’s been done with Julia in packages such as the SciML framework is astounding. But unless Julia gains the ability to compile compact standalone binaries, there will always be the need in domains such as robotics and autonomy to also use a second compiled language for final implementation.


Maybe there is where a LLVM Fortran implementation and Julia can finally converge? If one could write a fortran function like (to be simplistic)

function f(x)
  implicit none
  double precision :: y, x
  do i = 1, 10
    y = 2*x + y
  return y

and call it directly from Julia, what would be the difference to a Julia code? The fact that inside that function aliasing is not allowed and some very minor syntactic differences. If there was a macro to avoid aliasing in Julia and make “implicit one”, such that one had to explicit define the scope of the variable, like:

@no_aliasing f(x)
  local y::Float64, x::Float64
  for i in 1:10
    y = 2*x + y
  return y

That would be just like Fortran, and that could be a final optimization for some very performance-sensitive code. Of course such a function can be statically compiled, as everything is known. If LLVM can understand the Fortran function, it certainly could interpret the above “Julia” function in the same way.

Probably there is not much push to have such a feature because there are other strategies which experienced Julia programmers find to achieve that last 1:1 to C performance at the end, without needing static types (not that I am a parameter for anything, but even I managed to do it the programs I write for my research).

1 Like

There’s an issue open for this IIRC.


Probably this one: feature request: local static variables · Issue #15056 · JuliaLang/julia · GitHub

(lots of related requests, but it seems that at the end the specific problems were always solved otherwise)

I used to hear this all the time as a reason that Fortran was purportedly better than C for high-performance code. Then in 1999, C finally got the restrict keyword, which erased this supposedly huge advantage … but hardly anyone seems to use it because the benefits turn out to be marginal at best in most real-world cases, and they still haven’t bothered to officially adopt it for C++ (though many compilers support it as an extension).

This is not the “aliasing” that Chris was referring to. He was referring to aliasing where two pointers (e.g. two arrays) refer to the same memory under the hood, which is disallowed by Fortran and in principle this can allow additional compiler optimizations in some cases. (In Julia parlance, it’s about multiple references to the same instance of a mutable type.)

(In contrast, alias analysis is irrelevant for variables of immutable types, as in the code you posted above working with Float64 variables.)


Yes, my example was too simplistic, I was trying to say that a label is always bound to the same variable (mutable or immutable), given the declarations. I understand (after learning here) that for immutables this distinction does not even make sense.

(probably what you said is even more precise, there is some abstraction to get the difference between the pointer and the label, but from a Fortran user perspective I think they are pretty much the same).

How does LFortran compare with Cling, the interactive C++ interpreter? Personally, I tried the latter in Jupyter for 5 minutes before giving up, because it doesn’t support structured bindings introduced by C++17 (similar to multple assignments in Python and Julia), which is apparently an old unresolved bug in Cling. Maybe C++ features are growing too fast for them to keep up.

LoopVectorization.jl is hard to beat for examples like that:

julia> using FLoops, Base.Threads, LoopVectorization, BenchmarkTools

julia> function f_threads(N)
         total_Sum = zeros(nthreads())
         @threads for i in 1:N
           total_Sum[threadid()] += sin(i)
f_threads (generic function with 1 method)

julia> function f_floop(N)
         @floop for i in 1:N
           @reduce(total_Sum += sin(i))
f_floop (generic function with 1 method)

julia> function f_avx(N)
         s = 0.0
         @avxt for i in 1:N
           s += sin(i)
f_avx (generic function with 1 method)

julia> const LIBOMPHELLO = "/home/chriselrod/Documents/progwork/fortran/libparallelhelloworld.so";

julia> f_omp(N::Int) = @ccall LIBOMPHELLO.sin_range_sum(N::Ref{Clong})::Float64
f_omp (generic function with 1 method)

julia> @time f_omp(100)
  0.000891 seconds

julia> @time f_threads(100)
  0.099024 seconds (359.03 k allocations: 20.958 MiB, 14.34% compilation time)

julia> @time f_floop(100)
┌ Warning: use values(kwargs) and keys(kwargs) instead of kwargs.data and kwargs.itr
│   caller = #AdHocRF#105 at combinators.jl:249 [inlined]
└ @ Core ~/.julia/packages/Transducers/oLdU1/src/combinators.jl:249
┌ Warning: use values(kwargs) and keys(kwargs) instead of kwargs.data and kwargs.itr
│   caller = #_#114 at executors.jl:46 [inlined]
└ @ Core ~/.julia/packages/Transducers/oLdU1/src/executors.jl:46
  0.562706 seconds (2.29 M allocations: 128.427 MiB, 5.08% gc time, 3.53% compilation time)

julia> @time f_avx(100)
  0.000020 seconds

julia> sleep(1);

julia> N = 1000000000

julia> @time f_omp(N)
  2.205502 seconds

julia> @btime f_omp($N)
  2.181 s (0 allocations: 0 bytes)

julia> @time f_threads(N)
  5.049253 seconds (190 allocations: 16.703 KiB)

julia> @time f_floop(N)
  1.272356 seconds (693 allocations: 55.703 KiB)

julia> @btime f_floop($N)
  1.272 s (711 allocations: 57.05 KiB)

julia> @time f_avx(N)
  0.077365 seconds (103 allocations: 6.938 KiB, 3.85% compilation time)

julia> @btime f_avx($N)
  45.923 ms (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.7.0-DEV.1052
Commit cc59947c03* (2021-05-02 02:36 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)

Thats 46 milliseconds for the @avxt version, vs >2 seconds for the OMP Fortran, for more than a 40x improvement.

Fortran code:

module Parallel_Hello_World

  implicit none


    real(C_double) function sin_range_sum(N) BIND(C, name="sin_range_sum")
      integer(C_long), intent(in) :: N
      real(C_double)  :: s
      integer(C_long) :: i
      s = 0d0
!$omp parallel do reduction(+: s)
      do i = 1,N
         s = s + dsin(dble(i))
      end do
      sin_range_sum = s
    end function sin_range_sum

end module Parallel_Hello_World

Like in the above demo, many of LoopVectorization’s benchmarks are examples of small benchmarks where “pure Julia” code has a marked advantage.
When it comes to extracting all the performance you can get, Julia’s metaprogramming facilities and multipledispatch are a great tool, allowing for simple looking code that expresses the intent/behavior, while allowing arbitrary transformations under the hood to implement that behavior as quickly as possible.

Two options from Base Julia:
@simd ivdep, and Base.Experimental.@aliasscope with Const on Arrays

help?> Base.Experimental.@aliasscope
  @aliasscope expr

  Allows the compiler to assume that all Consts are not being modified through stores within this scope, even if the compiler can't prove this to be the case.

  │ Warning
  │  Experimental API. Subject to change without deprecation.

Note that LLVM’s aliasing assumptions require that elements do not alias one another. This is violated by BitArrays, because many bits belong to the same chunk (chunks are UInt64, or 64 bits, under the hood).
Within the package ecosystem, @avx also assumes no aliasing.

To expand on this, every now and then there’s a benchmark where it helps. For example:

julia> function simd_map!(f, y, x)
           @inbounds @simd for i ∈ eachindex(y,x)
               y[i] = f(x[i])
simd_map! (generic function with 1 method)

julia> function simd_ivdep_map!(f, y, x)
           @inbounds @simd ivdep for i ∈ eachindex(y,x)
               y[i] = f(x[i])
simd_ivdep_map! (generic function with 1 method)

julia> using VectorizationBase

julia> x = randn(256); y = similar(x);

julia> @benchmark simd_map!(VectorizationBase.vexp, $y, $x)
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     653.994 ns (0.00% GC)
  median time:      691.679 ns (0.00% GC)
  mean time:        692.923 ns (0.00% GC)
  maximum time:     920.909 ns (0.00% GC)
  samples:          10000
  evals/sample:     165

julia> @benchmark simd_ivdep_map!(VectorizationBase.vexp, $y, $x)
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     130.065 ns (0.00% GC)
  median time:      132.843 ns (0.00% GC)
  mean time:        132.991 ns (0.00% GC)
  maximum time:     175.632 ns (0.00% GC)
  samples:          10000
  evals/sample:     891

julia> y ≈ exp.(x)

VectorizationBase.vexp, like Base.exp, uses a lookup table. Base.exp however is not marked @inline, so the compiler can’t optimize it here, preventing us from being able to use it as an example.

Because exp is looking up some values from a table, the compiler needs to know that y – the vector our loops are storing into – doesn’t alias that table. If they did alias, each assignment to y could change future loads, meaning that the iterations would have to be done in order, i.e. parallel execution with SIMD wouldn’t yield the correct answer. (Of course, you should never be writing into exp's table, that’d corrupt exp!)

Because the accesses to the table are random, LLVM doesn’t know how to prove they don’t alias with simple pointer comparisons. If accesses were regular, you could just check that the pointers are far enough apart, e.g. at least the register size * unroll factor apart.

Compilers are really good about doing this these days, and a simple check like that is so cheap, that normally it’s hard to notice the difference. It’s just in the tricker cases, like with tables, that it can make a big difference.


Looks interesting, but it is still C++ :frowning:

1 Like

The Base.exp case is especially annoying because it uses a Tuple for the table, so the compiler really should be able to figure out that there isn’t any possibility of aliasing.

The main difference from Cling is that Cling is built on top of Clang while LFortran is a compiler that works both as an ahead of time compiler to binaries, as well as interactively and it has been designed from scratch to be that way. The interactive part is just a relatively small add-on. All features should work both ways.