Julia gets mentioned in an article about FORTRAN

John! That is great! Really great to see that this has opened up as an option.

Yes, this is just a ā€œproxy benchmarkā€, as @lmiq said. I actually like this particular ā€œproxyā€, as I often find having a loop and calling a subroutine inside the loop (some kind of a numerical function) to apply to each element of an array (and the result is an array) or to apply to each element and take a sum (as in the above case). And so to be able to optimize these kind of benchmarks to obtain optimal performance is essential. Obviously, this is not the only thing we want to get optimized, but itā€™s a start.

If anyone of you are interested, we have started a ā€œbenchmarkā€ repository at:

GitHub - fortran-lang/benchmarks: Fortran benchmarks

But didnā€™t have time to actually add any. The above benchmark would be a great candidate. You can browse through the issues there, such as this one for some background discussion how to approach this:

Benchmark criteria Ā· Issue #2 Ā· fortran-lang/benchmarks Ā· GitHub

Julia has its own benchmarks here: GitHub - JuliaLang/Microbenchmarks: Microbenchmarks comparing the Julia Programming language with other languages. Last time I suggested to use -ffast-math to actually speedup one Fortran benchmark there by a factor of 2, it was not approved by the Julia community. My own conclusion from that is that I think our communities will not agree on how to run the benchmarks, what compiler options to use and how to interpret the results. However, I think we might agree on some (subset) of the available benchmarks and how to improve them. So we could collaborate on that.

5 Likes

In my relatively short experience here it is virtually impossible to ā€œagreeā€ on these benchmarks. Both languages can write code equally performant, and ā€œfairnessā€ is in the eyes of the beholder. I had before written Julia code faster than my own Fortran code and it was shown to me here (actually C. Elrod did) how I should be executing the Fortran code to be equally performant. So I became a better Fortran userā€¦

At the end it comes down to what is more natural to write in one or other language and how likely someone will find the optimal code. But that is not a quantitative variable.

4 Likes

I suppose that if you are comparing
a loop of sequentially computing a modest function call (sine of an integer) and
trying to compute the same functions in some maximally parallel fashion,

and tying this together by summing them ā€¦

you could write it as map-reduce. Conceptually In common lisp (reduce #'+ (mapcar #'sin ā€™ (1 2 3 ā€¦)))

but you wouldnā€™t actually program it that way since it starts by construction of the list ā€¦
see

so comparing map-reduce ā€œin parallelā€ somehow vs map-reduce ā€œusing one processorā€ is the benchmark. Is there a Julia MapReduceInParallel ??

Maybe GitHub - tkf/ThreadsX.jl: Parallelized Base functions qualifies?

1 Like

In Julia there is no reason not to write the loop explicitly.

@rfateman here are some options to write one-liners, compared to the above. Hard to beat the loop with avx:

julia -t4 sum.jl
loop  293.541 ms (0 allocations: 0 bytes)
1.9558914085412433
avx  22.379 ms (0 allocations: 0 bytes)
1.955891408541291
avxt  8.393 ms (0 allocations: 0 bytes)
1.955891408541158
simd  292.663 ms (0 allocations: 0 bytes)
1.9558914085412433
sumiter  294.163 ms (0 allocations: 0 bytes)
1.9558914085412433
mapreduce  298.790 ms (0 allocations: 0 bytes)
1.9558914085409373
threadsx.mapreduce  109.673 ms (255 allocations: 15.66 KiB)
1.9558914085412005

Code
using BenchmarkTools, Test
using LoopVectorization
using ThreadsX

function f(N)
    s = 0.
    for i in 1:N
        s += sin(i)
    end
    s
end


function f_avx(N)
    s = 0.
    @avx for i in 1:N
        s += sin(i)
    end
    s
end

function f_avxt(N)
    s = 0.
    @avxt for i in 1:N
        s += sin(i)
    end
    s
end

function f_simd(N)
    s = 0.
    @simd for i in 1:N
        s += sin(i)
    end
    s
end

f_sumiter(N) = sum(sin(i) for i in 1:N)

f_mapreduce(N) = mapreduce(sin, +, 1:N)

f_threadx_mapreduce(N) = ThreadsX.mapreduce(sin, +, 1:N)

N = 10000000

@test f(N) ā‰ˆ f_avx(N) ā‰ˆ f_avxt(N) ā‰ˆ f_simd(N) ā‰ˆ f_sumiter(N) ā‰ˆ f_mapreduce(N) ā‰ˆ f_threadx_mapreduce(N)

print("loop");println(@btime f($N))
print("avx");println(@btime f_avx($N))
print("avxt");println(@btime f_avxt($N))
print("simd");println(@btime f_simd($N))
print("sumiter");println(@btime f_sumiter($N))
print("mapreduce");println(@btime f_mapreduce($N))
print("threadsx.mapreduce");println(@btime f_threadx_mapreduce($N))

2 Likes

I agree that benchmarks lead to disputes, but can also be educational. That is, you work on a benchmark with a pre-determination to find one that shows your latest project is better than the competition! Or you use it to improve your project! To do this you write what you think are equivalent codes using someone elseā€™s system. Typically, an expert in that other system can find a better way to write the benchmark and so the race is on. But you learned a bit about the other system.
With this in mind, hereā€™s another version of the benchmark written in lispā€¦

 (loop for i from 1 to 100 sum (sin i))

;;; now you might say, huh, Lisp has keywords, a do loop, ???
well, loop is defined as a macro. Thereā€™s another ā€œdoā€ version that has no keywords, but more parentheses if you preferā€¦

Now the rest of the benchmark requires a study as to how to compile this, and, if I understand what you are really trying to compare, can you indicate a parallel version that will compile differently, say
(loop for i from 1 to 10 sumparallel (sin i))

The coders of the compiler for common lisp have at their disposal whatever tools are available to any other compiler writers. Some common lisp compilers target C, some target JVM, some have their own byte-code, some generate i86 assembler, ARM, M1, etc, I suppose one could target Julia. Common lisp can also run MPI, for example
https://common-lisp.net/project/cl-mpi/

I point this out not because I ran it and it is faster than FORTRAN or Julia.

I only say this to point out that the expression is about as clear as can be, even though it is in lisp. Could it be made clearer? um, maybe
(loop for i from 1.0d0 to 100.0d0 sum (sin i)) would ease the burden on the compiler type-inference task, but it might generate the same code. Certainly the result type of (sin i) is a float.
Also, at least some common lisps (SBCL, a popular one) does just-in-time compilation.

Hereā€™s another upvote, and let that include not only executables but libraries, too.

My 2 cents: At work I am currently familiarising myself with a Fortran PDE-solving code (about 10 years old). As there are not many proficient developers around, we are discussing rewriting/replacing the codebase. This code (among others) is being used in dll form by a C# GUI tool. Therefore, afaict Julia is out of the race right out the door - no way to compile a dll (and also no proficient devs yet).

It looks like the problem could otherwise be a good fit for Julia, but right now it looks like a mixture of using python to wrap/write tests/probe (to ease interaction, many know python) and doing as many improvements/ modernizations as feasible (so Iā€™m picking up Fortran now, and that makes sense for me although itā€™s ā€œoldā€).
Personally, this is fine for me - my toolkit will include Python, Modelica, Fortran (maybe a little C#) and should cover most problems Iā€™ll encounter. One day Julia will hopefully be at a point where it can replace one or more of these for me, or at least join the club beyond hobby/spare time, but right now thatā€™s not in the cards.

4 Likes

I think Julia would make a fine outer layer (wrapper) of such a code as you described.
So perhaps another viewpoint might be helpful: instead of thinking of Julia as something
that needs to be wrapped (as a dynamic lib, perhaps), think of Julia as the glue?

The cool alternative would be to replace everything with Julia. Not that I claim that to be realistic for his application specifically at this point, but what is there (and I have some case like this) involves: compiling for different platforms the Fortran code, ship or make the user install python, have run the GUI only on Windows (C#).

Alternatively one could simply download a very simple installer that ships Julia and the application, or if the user already has Julia, just an add.

I guess (not sure) that the GUI is where things are less mature, am I right?

(Also if the application relies heavily in the user actually using python, that is another problem)

Petr: Sure! One draw of Julia is that it has the potential to be both the (nice) glue and the (fast) gears.
In this case, any glue we need will be Python, as we have >5 proficient people, and 0 for Julia (I only know a little).

Leandro: indeed, that would solve the N-language problem, right there. :smile:
I guess what I tried to achieve here is illustrate how the lack of dll capability can be an adoption blocker in some cases.
(For completeness: maybe i did not express this well, but python is only for developer use, the users use the solver dll via a C# GUI (that is maintained elsewhere). Replacing the small dll with a couple hundred MB of interpreter and code is not realistic I think.

1 Like

The manual describes how to link an application against libjulia.so on Linux; does that not work on Windows for dlls?

A ā€œcouple of hundred MBā€ dll is not dramatic compared to the size of the C# runtimeā€¦ If you are on a desktop machine, does the dll size really matter?

3 Likes

Thanks for the pointer, Iā€™ll have to take a look at that.

Will it not bi sitting in RAM? Will it not be contributing to RAM fragmentation?
Ofc it matters.

If itā€™s one blob, it wonā€™t cause any fragmentation. Also, if large portions of it arenā€™t needed, they will likely be moved to swap by the OS. The binary size isnā€™t a good thing (hence StaticCompiler.jl), but imo, itā€™s a pretty minor concern.

I didnā€™t mean to say that the blob would fragment but that dll will have to play arround with all others loaded in memory. Now, I donā€™t know about what parts of a dll get loaded into RAM but I know from my system (32 Gb RAM) that Task manager tells me that I have 60% of that memory free but another tool tells me that the largest chunk is only 2 GB. Than mean I currently cannot load an image larger than that size.

...
 C:\SVN\mirone30\bin\win32\m_dispatcher.dll                  7afe0000    00006000    0000a000
 C:\SVN\mirone30\bin\win32\libmwbuiltins.dll                 7aff0000    00027000    000f9000*
 C:\SVN\mirone30\bin\win32\libmx.dll                         7b110000    00028000    00008000
 C:\WINDOWS\SYSTEM32\MFC42.DLL                               7b140000    00124000    04d7c000*
 <anonymous>                                                 7ffe0000    00001000    00007000
 <anonymous>                                                 7ffe8000    00001000    7fd57000*
 <anonymous>                                                 ffd40000    00060000    00000000
 <anonymous>                                                 ffda0000    00100000    00000000
 <anonymous>                                                 ffea0000    00011000    0000f000
 <anonymous>                                                 ffec0000    00002000    0000e000
 <anonymous>                                                 ffed0000    00021000    0000f000
 <anonymous>                                                 fff00000    00002000    0000e000
 <anonymous>                                                 fff10000    00001000    0000f000
 <anonymous>                                                 fff20000    00023000    800bd000
                                                                        ==========  ==========
 Totals                                                                  0a99d000    75663000

 Largest available memory block is 2144694272 bytes (2045.34 MB) located at address 7ffe9000

At the risk of careening off topic, I think that youā€™re confusing some things about physical and virtual memory. Iā€™m not on Windows, but I have a vague idea how it works. Outside of some specialized uses, physical memory does not get fragmented (or alternatively, it is completely fragmented); it is assigned on a page-by-page basis to virtual memory in the running processes. A page on Windows is 4kb. Furthermore, the code and data of programs and dlls are only assigned to physical memory on demand as they are needed, even if a large chunk of virtual memory is reserved for them when loaded.

I donā€™t know what the list is that youā€™ve posted; it seems to show mappings that have already been made and freed. But the available address space on 64 bit machines is enormous, 2^48. I donā€™t you need to worry about loading images >2GB in either physical or virtual memory.

This is just to say that a dll that takes up 200MB of disk space or maps 2GB of memory is just not a big deal.

2 Likes

Sure I may be confusing things. I donā€™t know much on the details on how it works. The numbers I showed are from a MEX file provided by a Mathworks worker (no source code) that show the RAM fragmentation and me experience on using it is that I could never load data to RAM that was larger then the largest chunk reported by that tool. Matlab has now a builtin function that gives this information but I forgot its name.

Julia: letā€™s keep the advantages of Python being interactive, easy to learn, ā€œduck typingā€, etc. Now letā€™s see how we can fix the main flaw: speed. To do that, one has to introduce types, use LLVM from the beginning, figure out the semantics in a way so that it can be compiled to be fast, etc. Modify the language a bit. Since we are introducing a new language, letā€™s also do some other cool stuff, such as multiple dispatch.

Julia: Let us make a programming language for mathematicians and scientists! Lets avoid the licensing problems of Matlab, the archetypal scientific programming package that is not quite a language. Keep the interactive aspect and as easy to learn as the user desires. Let us make it fast by making it compiled and use LLVM for portability. Let us not make it cryptic and over-vectorized in syntax, except when it remains readable to do so. Greenspunā€™s 10th rule says all good languages become a subset of Common LISP, so letā€™s actually build it from LISP, but hide that from the user (but not really). Lastly, let us build a story around how multiple dispatch is the secret to success, but itā€™s really libraries and package management that will create a movement.

7 Likes

\textrm{Julia}\approx (((\textrm{Matlab} \cup \textrm{Python}) \setminus \textrm{design_mistakes}) \cup \textrm{multiple_dispatch}) \triangleright(\textrm{Lisp} \setminus \textrm{parentheses})

6 Likes