I didn’t mean to imply that there’s something wrong with Fortran or OpenMP. I’ve used OpenMP with Fortran, a few times, on very trivially parallelizable code and it was pleasant to work for that case. I was just curious if it’s a part of language such as coarrays and whether it has ability to schedule tasks depth first. Thank you for explanation.
Can you give some examples about it? I don’t think Fortran can perform some secret compiler optimizations other languages can’t have. ML and LLVM community also puts really a lot of efforts into specialized compiler optimization, such as vectorization and polyhedral model, which is hard to beat.
The last time I looked at it was
but it would be awesome to have more up to date benchmarks.
Here is one example (not sure how representative it is):
Fortran:
Code
PROGRAM Parallel_Hello_World
USE OMP_LIB
REAL*8 :: partial_Sum, total_Sum
!$OMP PARALLEL PRIVATE(partial_Sum) SHARED(total_Sum)
partial_Sum = 0.d0
total_Sum = 0.d0
!$OMP DO
DO i=1,1000000000
partial_Sum = partial_Sum + dsin(dble(i))
END DO
!$OMP END DO
!$OMP CRITICAL
total_Sum = total_Sum + partial_Sum
!$OMP END CRITICAL
!$OMP END PARALLEL
PRINT *, "Total Sum: ", total_Sum
END
Result:
leandro@pitico:~/Drive/Work/JuliaPlay% gfortran -O3 omp.f95 -o omp
leandro@pitico:~/Drive/Work/JuliaPlay% time ./omp
Total Sum: 0.42129448674541342
real 0m54,495s
user 0m54,476s
sys 0m0,004s
leandro@pitico:~/Drive/Work/JuliaPlay% gfortran -O3 -fopenmp omp.f95 -o omp
leandro@pitico:~/Drive/Work/JuliaPlay% time ./omp
Total Sum: 0.42129448674645509
real 0m12,538s
user 1m29,020s
sys 0m0,016s
Julia using FLoops
:
Code
using FLoops
function f()
@floop for i in 1:1000000000
@reduce(total_Sum += sin(i))
end
total_Sum
end
println("Total sum: ",f())
Result:
leandro@pitico:~/Drive/Work/JuliaPlay% time julia floop.jl
Total sum: 0.4212944867466973
real 0m40,729s
user 0m41,039s
sys 0m0,608s
leandro@pitico:~/Drive/Work/JuliaPlay% time julia -t8 floop.jl
Total sum: 0.4212944867465145
real 0m14,743s
user 1m4,074s
sys 0m0,751s
Julia using @threads
:
using Base.Threads
function f()
total_Sum = zeros(nthreads())
@threads for i in 1:1000000000
total_Sum[threadid()] += sin(i)
end
sum(total_Sum)
end
println("Total sum: ",f())
leandro@pitico:~/Drive/Work/JuliaPlay% time julia -t8 threads.jl
Total sum: 0.4212944867465146
real 0m10,514s
user 1m11,253s
sys 0m0,544s
Note that Julia is faster than Fortran with a single thread (no idea why).
I split this in a new thread, I guess I don’t really understand what is going on when using atomic operations, and now removed the initial test, where I used atomic_add!
here because it didn’t make sense.
I read those posts and it is still not clear to me why someone who is not working with a ton of Fortran legacy code already would start with the language now, even if you manage all the ambitious goals of LFortran.
Don’t get me wrong, I think that Fortran was a terrific language for its time, but most of its key ingredients have been adapted by other languages since. Eg all the highlights you list were already in Julia from day 0, except for generalized indexing which was added later, but required no fundamental redesign of anything deep.
I would tend agree with Thomas Clune quoted in the article and @ChrisRackauckas above: aggressively devirtualized multiple dispatch is the game changer for Julia. Having a REPL, macros, and other nice things are just there because it would have been weird to design a language without them after the 1990s, just like seatbelts and later ABS became mandatory in cars — it is to obvious a benefit to miss.
Obviously, I cannot speak for Certik, but I thought I would mention that there’s a huge pressure on the employees of the national labs to keep Fortran codes alive. The national labs have huge investments in massive codes in Fortran, and whatever keeps them going is well rewarded.
Lack of aliasing makes Fortran a little hard to write, but it makes it easier for the compiler to optimize a few things. When aliasing is allowed, you need a bunch of analyses to figure out whether two bindings could ever be pointing to the same array (which may eliminate some things like simd ivdep
), while you can those for free if the language does not allow that to ever occur. Moving to a dynamic language like Julia makes that even harder than proving it in a language like C, and so the common “I got within 2x of C for a direct translation but getting to 1-1 is hard” usually ends up boiling down to whether aliasing or escaping can be proved.
Except one.
Being able to produce small (or not that small) stand-alone executables. IMO this is a major feature that non-interactive language users have troubles to comprehend.
I wish I could upvote this twice. In my opinion this is a major disadvantage of Julia compared to other compiled languages.
For example, consider the Pixhawk hardware which is one of the most ubiquitous platforms to build autonomous planes, cars, boats, etc. on. It has at most 2 MB of flash memory. By some accounts a simple “Hello World” binary made using PackageCompiler.jl can be in excess of 100 MB. Perhaps this size can be reduced, but I’ve never seen any documentation or methodology on how to do so.
I think Julia is one of the most elegant and powerful languages available today. What’s been done with Julia in packages such as the SciML framework is astounding. But unless Julia gains the ability to compile compact standalone binaries, there will always be the need in domains such as robotics and autonomy to also use a second compiled language for final implementation.
Maybe there is where a LLVM Fortran implementation and Julia can finally converge? If one could write a fortran function like (to be simplistic)
function f(x)
implicit none
double precision :: y, x
do i = 1, 10
y = 2*x + y
end
return y
end
and call it directly from Julia, what would be the difference to a Julia code? The fact that inside that function aliasing is not allowed and some very minor syntactic differences. If there was a macro to avoid aliasing in Julia and make “implicit one”, such that one had to explicit define the scope of the variable, like:
@no_aliasing f(x)
local y::Float64, x::Float64
for i in 1:10
y = 2*x + y
end
return y
end
That would be just like Fortran, and that could be a final optimization for some very performance-sensitive code. Of course such a function can be statically compiled, as everything is known. If LLVM can understand the Fortran function, it certainly could interpret the above “Julia” function in the same way.
Probably there is not much push to have such a feature because there are other strategies which experienced Julia programmers find to achieve that last 1:1 to C performance at the end, without needing static types (not that I am a parameter for anything, but even I managed to do it the programs I write for my research).
There’s an issue open for this IIRC.
Probably this one: https://github.com/JuliaLang/julia/issues/15056
(lots of related requests, but it seems that at the end the specific problems were always solved otherwise)
I used to hear this all the time as a reason that Fortran was purportedly better than C for high-performance code. Then in 1999, C finally got the restrict
keyword, which erased this supposedly huge advantage … but hardly anyone seems to use it because the benefits turn out to be marginal at best in most real-world cases, and they still haven’t bothered to officially adopt it for C++ (though many compilers support it as an extension).
This is not the “aliasing” that Chris was referring to. He was referring to aliasing where two pointers (e.g. two arrays) refer to the same memory under the hood, which is disallowed by Fortran and in principle this can allow additional compiler optimizations in some cases. (In Julia parlance, it’s about multiple references to the same instance of a mutable type.)
(In contrast, alias analysis is irrelevant for variables of immutable types, as in the code you posted above working with Float64
variables.)
Yes, my example was too simplistic, I was trying to say that a label is always bound to the same variable (mutable or immutable), given the declarations. I understand (after learning here) that for immutables this distinction does not even make sense.
(probably what you said is even more precise, there is some abstraction to get the difference between the pointer and the label, but from a Fortran user perspective I think they are pretty much the same).
How does LFortran compare with Cling, the interactive C++ interpreter? Personally, I tried the latter in Jupyter for 5 minutes before giving up, because it doesn’t support structured bindings introduced by C++17 (similar to multple assignments in Python and Julia), which is apparently an old unresolved bug in Cling. Maybe C++ features are growing too fast for them to keep up.
LoopVectorization.jl is hard to beat for examples like that:
julia> using FLoops, Base.Threads, LoopVectorization, BenchmarkTools
julia> function f_threads(N)
total_Sum = zeros(nthreads())
@threads for i in 1:N
total_Sum[threadid()] += sin(i)
end
sum(total_Sum)
end
f_threads (generic function with 1 method)
julia> function f_floop(N)
@floop for i in 1:N
@reduce(total_Sum += sin(i))
end
total_Sum
end
f_floop (generic function with 1 method)
julia> function f_avx(N)
s = 0.0
@avxt for i in 1:N
s += sin(i)
end
s
end
f_avx (generic function with 1 method)
julia> const LIBOMPHELLO = "/home/chriselrod/Documents/progwork/fortran/libparallelhelloworld.so";
julia> f_omp(N::Int) = @ccall LIBOMPHELLO.sin_range_sum(N::Ref{Clong})::Float64
f_omp (generic function with 1 method)
julia> @time f_omp(100)
0.000891 seconds
-0.12717101366041927
julia> @time f_threads(100)
0.099024 seconds (359.03 k allocations: 20.958 MiB, 14.34% compilation time)
-0.12717101366042005
julia> @time f_floop(100)
┌ Warning: use values(kwargs) and keys(kwargs) instead of kwargs.data and kwargs.itr
│ caller = #AdHocRF#105 at combinators.jl:249 [inlined]
└ @ Core ~/.julia/packages/Transducers/oLdU1/src/combinators.jl:249
┌ Warning: use values(kwargs) and keys(kwargs) instead of kwargs.data and kwargs.itr
│ caller = #_#114 at executors.jl:46 [inlined]
└ @ Core ~/.julia/packages/Transducers/oLdU1/src/executors.jl:46
0.562706 seconds (2.29 M allocations: 128.427 MiB, 5.08% gc time, 3.53% compilation time)
-0.12717101366042072
julia> @time f_avx(100)
0.000020 seconds
-0.12717101366042005
julia> sleep(1);
julia> N = 1000000000
1000000000
julia> @time f_omp(N)
2.205502 seconds
0.42129448674486225
julia> @btime f_omp($N)
2.181 s (0 allocations: 0 bytes)
0.4212944867448605
julia> @time f_threads(N)
5.049253 seconds (190 allocations: 16.703 KiB)
0.42129448674422765
julia> @time f_floop(N)
1.272356 seconds (693 allocations: 55.703 KiB)
0.42129448674746617
julia> @btime f_floop($N)
1.272 s (711 allocations: 57.05 KiB)
0.42129448674746617
julia> @time f_avx(N)
0.077365 seconds (103 allocations: 6.938 KiB, 3.85% compilation time)
0.4212944867467526
julia> @btime f_avx($N)
45.923 ms (0 allocations: 0 bytes)
0.4212944867467526
julia> versioninfo()
Julia Version 1.7.0-DEV.1052
Commit cc59947c03* (2021-05-02 02:36 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
Thats 46 milliseconds for the @avxt
version, vs >2 seconds for the OMP Fortran, for more than a 40x improvement.
Fortran code:
Summary
module Parallel_Hello_World
use ISO_C_BINDING
USE OMP_LIB
implicit none
contains
real(C_double) function sin_range_sum(N) BIND(C, name="sin_range_sum")
integer(C_long), intent(in) :: N
real(C_double) :: s
integer(C_long) :: i
s = 0d0
!$omp parallel do reduction(+: s)
do i = 1,N
s = s + dsin(dble(i))
end do
sin_range_sum = s
end function sin_range_sum
end module Parallel_Hello_World
Like in the above demo, many of LoopVectorization’s benchmarks are examples of small benchmarks where “pure Julia” code has a marked advantage.
When it comes to extracting all the performance you can get, Julia’s metaprogramming facilities and multipledispatch are a great tool, allowing for simple looking code that expresses the intent/behavior, while allowing arbitrary transformations under the hood to implement that behavior as quickly as possible.
Two options from Base Julia:
@simd ivdep
, and Base.Experimental.@aliasscope
with Const
on Array
s
help?> Base.Experimental.@aliasscope
@aliasscope expr
Allows the compiler to assume that all Consts are not being modified through stores within this scope, even if the compiler can't prove this to be the case.
│ Warning
│
│ Experimental API. Subject to change without deprecation.
Note that LLVM’s aliasing assumptions require that elements do not alias one another. This is violated by BitArray
s, because many bits belong to the same chunk (chunks are UInt64
, or 64 bits, under the hood).
Within the package ecosystem, @avx
also assumes no aliasing.
To expand on this, every now and then there’s a benchmark where it helps. For example:
julia> function simd_map!(f, y, x)
@inbounds @simd for i ∈ eachindex(y,x)
y[i] = f(x[i])
end
end
simd_map! (generic function with 1 method)
julia> function simd_ivdep_map!(f, y, x)
@inbounds @simd ivdep for i ∈ eachindex(y,x)
y[i] = f(x[i])
end
end
simd_ivdep_map! (generic function with 1 method)
julia> using VectorizationBase
julia> x = randn(256); y = similar(x);
julia> @benchmark simd_map!(VectorizationBase.vexp, $y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 653.994 ns (0.00% GC)
median time: 691.679 ns (0.00% GC)
mean time: 692.923 ns (0.00% GC)
maximum time: 920.909 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 165
julia> @benchmark simd_ivdep_map!(VectorizationBase.vexp, $y, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 130.065 ns (0.00% GC)
median time: 132.843 ns (0.00% GC)
mean time: 132.991 ns (0.00% GC)
maximum time: 175.632 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 891
julia> y ≈ exp.(x)
true
VectorizationBase.vexp
, like Base.exp
, uses a lookup table. Base.exp
however is not marked @inline
, so the compiler can’t optimize it here, preventing us from being able to use it as an example.
Because exp
is looking up some values from a table, the compiler needs to know that y
– the vector our loops are storing into – doesn’t alias that table. If they did alias, each assignment to y
could change future loads, meaning that the iterations would have to be done in order, i.e. parallel execution with SIMD wouldn’t yield the correct answer. (Of course, you should never be writing into exp
’s table, that’d corrupt exp
!)
Because the accesses to the table are random, LLVM doesn’t know how to prove they don’t alias with simple pointer comparisons. If accesses were regular, you could just check that the pointers are far enough apart, e.g. at least the register size * unroll factor apart.
Compilers are really good about doing this these days, and a simple check like that is so cheap, that normally it’s hard to notice the difference. It’s just in the tricker cases, like with tables, that it can make a big difference.
Looks interesting, but it is still C++
The Base.exp
case is especially annoying because it uses a Tuple
for the table, so the compiler really should be able to figure out that there isn’t any possibility of aliasing.
The main difference from Cling is that Cling is built on top of Clang while LFortran is a compiler that works both as an ahead of time compiler to binaries, as well as interactively and it has been designed from scratch to be that way. The interactive part is just a relatively small add-on. All features should work both ways.
I know.
It’d be good to make an internal exp_inline
version that’s marked @inline
, and then have exp(x::Union{Float32,Float64}) = exp_inline(x)
.
That’d make it easier to test/track.
Or at least switch BetterExp to a tuple, so we can use that.
VectorizationBase needs pointers, so it can’t use a tuple itself.
Once we have an inlined tuple version to look at, we should check the TBAA metadata and then file an LLVM bug report. Perhaps that can be fixed, or they can point us to something that’s missing.
Awesome! (as usual…) Note you took a pretty large base case, and I would assume the speedup has more to do with single-core performance than multithreading performance. What I tried to measure some time ago was the threading overhead, i.e. how large must N be for threading to be helpful for a very simple function, and there I found openmp had a large advantage. But that was a while ago, and before loopvectorization and floops…