Thanks. I ran
@time several times and took the best timing. I think that would ensure the compilation time is not included. Or am I wrong?
Thanks. I ran
That’s fine. That’s what
@btime does for you.
Referencing an earlier post about using Julia in restricted environments:
I encountered something similar when I tried to write a custom AWS Lambda runtime for Julia. I don’t recall if I used ApplicationBulder or PackageCompiler, so I can’t be as specific as I would like to be, but a simple hello world was too big to fit into the available storage for a custom runtime
Huh, I must be missing something here. It seems you are calling sin() 100,000,000 times, which suggests to me that the cost of adding the various results is a small percentage of the computation.
The computation of sin(1), sin(2), etc can be done using the formula for sin(x+y), all you need is to compute sin(1) and cos(1) and some arithmetic. sin(2) is 2*cos(1)*sin(1) for instance.
While adding 100,000,000 numbers will give you some kind of timing data, the sum may be substantially inaccurate because of cancellation, roundoff, etc. (look up Kahan summation).
There are plenty of scientific computing benchmarks in the open literature (in FORTRAN, Matlab, C, …) that might be coded in Julia [if they are not already available in Julia] that would make a more useful comparison. Maybe See Lies, damned lies, and benchmarks: What makes a good performance metric • GraphicSpeak
The Julia benchmarks I was able to easily spot (via Google) were kind of micro-optimization claims, but maybe there’s more?
I was the one originally posting this test, and the purpose was only to compare the possible overhead of threading in comparison to a OpenMP. The reason for using a sin there is because with that the compiler cannot trick us by determining the result without computing the sum.
@dcgrigsby please look at this thread. The limits on lambda have changed. The report here is that the Julia image size is around 500mb
John! That is great! Really great to see that this has opened up as an option.
Yes, this is just a “proxy benchmark”, as @leandromartinez98 said. I actually like this particular “proxy”, as I often find having a loop and calling a subroutine inside the loop (some kind of a numerical function) to apply to each element of an array (and the result is an array) or to apply to each element and take a sum (as in the above case). And so to be able to optimize these kind of benchmarks to obtain optimal performance is essential. Obviously, this is not the only thing we want to get optimized, but it’s a start.
If anyone of you are interested, we have started a “benchmark” repository at:
But didn’t have time to actually add any. The above benchmark would be a great candidate. You can browse through the issues there, such as this one for some background discussion how to approach this:
Julia has its own benchmarks here: GitHub - JuliaLang/Microbenchmarks: Micro benchmark comparison of Julia against other languages. Last time I suggested to use
-ffast-math to actually speedup one Fortran benchmark there by a factor of 2, it was not approved by the Julia community. My own conclusion from that is that I think our communities will not agree on how to run the benchmarks, what compiler options to use and how to interpret the results. However, I think we might agree on some (subset) of the available benchmarks and how to improve them. So we could collaborate on that.
In my relatively short experience here it is virtually impossible to “agree” on these benchmarks. Both languages can write code equally performant, and “fairness” is in the eyes of the beholder. I had before written Julia code faster than my own Fortran code and it was shown to me here (actually C. Elrod did) how I should be executing the Fortran code to be equally performant. So I became a better Fortran user…
At the end it comes down to what is more natural to write in one or other language and how likely someone will find the optimal code. But that is not a quantitative variable.
I suppose that if you are comparing
a loop of sequentially computing a modest function call (sine of an integer) and
trying to compute the same functions in some maximally parallel fashion,
and tying this together by summing them …
you could write it as map-reduce. Conceptually In common lisp (reduce #’+ (mapcar #'sin ’ (1 2 3 …)))
but you wouldn’t actually program it that way since it starts by construction of the list …
so comparing map-reduce “in parallel” somehow vs map-reduce “using one processor” is the benchmark. Is there a Julia MapReduceInParallel ??
Maybe GitHub - tkf/ThreadsX.jl: Parallelized Base functions qualifies?
In Julia there is no reason not to write the loop explicitly.
@rfateman here are some options to write one-liners, compared to the above. Hard to beat the loop with avx:
julia -t4 sum.jl loop 293.541 ms (0 allocations: 0 bytes) 1.9558914085412433 avx 22.379 ms (0 allocations: 0 bytes) 1.955891408541291 avxt 8.393 ms (0 allocations: 0 bytes) 1.955891408541158 simd 292.663 ms (0 allocations: 0 bytes) 1.9558914085412433 sumiter 294.163 ms (0 allocations: 0 bytes) 1.9558914085412433 mapreduce 298.790 ms (0 allocations: 0 bytes) 1.9558914085409373 threadsx.mapreduce 109.673 ms (255 allocations: 15.66 KiB) 1.9558914085412005
using BenchmarkTools, Test using LoopVectorization using ThreadsX function f(N) s = 0. for i in 1:N s += sin(i) end s end function f_avx(N) s = 0. @avx for i in 1:N s += sin(i) end s end function f_avxt(N) s = 0. @avxt for i in 1:N s += sin(i) end s end function f_simd(N) s = 0. @simd for i in 1:N s += sin(i) end s end f_sumiter(N) = sum(sin(i) for i in 1:N) f_mapreduce(N) = mapreduce(sin, +, 1:N) f_threadx_mapreduce(N) = ThreadsX.mapreduce(sin, +, 1:N) N = 10000000 @test f(N) ≈ f_avx(N) ≈ f_avxt(N) ≈ f_simd(N) ≈ f_sumiter(N) ≈ f_mapreduce(N) ≈ f_threadx_mapreduce(N) print("loop");println(@btime f($N)) print("avx");println(@btime f_avx($N)) print("avxt");println(@btime f_avxt($N)) print("simd");println(@btime f_simd($N)) print("sumiter");println(@btime f_sumiter($N)) print("mapreduce");println(@btime f_mapreduce($N)) print("threadsx.mapreduce");println(@btime f_threadx_mapreduce($N))
I agree that benchmarks lead to disputes, but can also be educational. That is, you work on a benchmark with a pre-determination to find one that shows your latest project is better than the competition! Or you use it to improve your project! To do this you write what you think are equivalent codes using someone else’s system. Typically, an expert in that other system can find a better way to write the benchmark and so the race is on. But you learned a bit about the other system.
With this in mind, here’s another version of the benchmark written in lisp…
(loop for i from 1 to 100 sum (sin i))
;;; now you might say, huh, Lisp has keywords, a do loop, ???
well, loop is defined as a macro. There’s another “do” version that has no keywords, but more parentheses if you prefer…
Now the rest of the benchmark requires a study as to how to compile this, and, if I understand what you are really trying to compare, can you indicate a parallel version that will compile differently, say
(loop for i from 1 to 10 sumparallel (sin i))
The coders of the compiler for common lisp have at their disposal whatever tools are available to any other compiler writers. Some common lisp compilers target C, some target JVM, some have their own byte-code, some generate i86 assembler, ARM, M1, etc, I suppose one could target Julia. Common lisp can also run MPI, for example
I point this out not because I ran it and it is faster than FORTRAN or Julia.
I only say this to point out that the expression is about as clear as can be, even though it is in lisp. Could it be made clearer? um, maybe
(loop for i from 1.0d0 to 100.0d0 sum (sin i)) would ease the burden on the compiler type-inference task, but it might generate the same code. Certainly the result type of (sin i) is a float.
Also, at least some common lisps (SBCL, a popular one) does just-in-time compilation.
Here’s another upvote, and let that include not only executables but libraries, too.
My 2 cents: At work I am currently familiarising myself with a Fortran PDE-solving code (about 10 years old). As there are not many proficient developers around, we are discussing rewriting/replacing the codebase. This code (among others) is being used in dll form by a C# GUI tool. Therefore, afaict Julia is out of the race right out the door - no way to compile a dll (and also no proficient devs yet).
It looks like the problem could otherwise be a good fit for Julia, but right now it looks like a mixture of using python to wrap/write tests/probe (to ease interaction, many know python) and doing as many improvements/ modernizations as feasible (so I’m picking up Fortran now, and that makes sense for me although it’s “old”).
Personally, this is fine for me - my toolkit will include Python, Modelica, Fortran (maybe a little C#) and should cover most problems I’ll encounter. One day Julia will hopefully be at a point where it can replace one or more of these for me, or at least join the club beyond hobby/spare time, but right now that’s not in the cards.
I think Julia would make a fine outer layer (wrapper) of such a code as you described.
So perhaps another viewpoint might be helpful: instead of thinking of Julia as something
that needs to be wrapped (as a dynamic lib, perhaps), think of Julia as the glue?
The cool alternative would be to replace everything with Julia. Not that I claim that to be realistic for his application specifically at this point, but what is there (and I have some case like this) involves: compiling for different platforms the Fortran code, ship or make the user install python, have run the GUI only on Windows (C#).
Alternatively one could simply download a very simple installer that ships Julia and the application, or if the user already has Julia, just an
I guess (not sure) that the GUI is where things are less mature, am I right?
(Also if the application relies heavily in the user actually using python, that is another problem)
Petr: Sure! One draw of Julia is that it has the potential to be both the (nice) glue and the (fast) gears.
In this case, any glue we need will be Python, as we have >5 proficient people, and 0 for Julia (I only know a little).
Leandro: indeed, that would solve the N-language problem, right there.
I guess what I tried to achieve here is illustrate how the lack of dll capability can be an adoption blocker in some cases.
(For completeness: maybe i did not express this well, but python is only for developer use, the users use the solver dll via a C# GUI (that is maintained elsewhere). Replacing the small dll with a couple hundred MB of interpreter and code is not realistic I think.
The manual describes how to link an application against libjulia.so on Linux; does that not work on Windows for dlls?
A “couple of hundred MB” dll is not dramatic compared to the size of the C# runtime… If you are on a desktop machine, does the dll size really matter?
Thanks for the pointer, I’ll have to take a look at that.
Will it not bi sitting in RAM? Will it not be contributing to RAM fragmentation?
Ofc it matters.