I believe in theory a high level language compiler could generate code for an algorithm that is faster than the C/C++ compiler if it has more information about the data objects and usage than the C/C++ compiler is allowed to assume. I don’t know if Julia has that information or not. I suspect in certain cases the answer is yes, but probably more often the answer is no.
This question also get’s into the “fast enough” question. As programmers (in general) we are not going for the fastest execution of code, we are going for fast enough. If we wanted the absolute fasted we’d probably be writing in assembler or compiling down to assembler then hand tuning that code.
As an example of that, I was looking at the BLAKE3 reference implementation in Rust. They have a native Rust implementation, then a faster implementation that uses SIMD instructions in a C library, then the fastest implementation that uses SIMD instructions in an assembler file. They obviously felt that the C compiler was NOT fast enough.
So the “correct” question for the general programming population is does the language let me write code fast, is that code easy to maintain, and does it run it fast enough.
Probably not the answers you were looking for, but don’t think Julia will every run “real” code faster than C++ code and the Intel compiler.
I wander what Chris Elrod would have to say about this. He seems to be writing stuff in Julia that is beating the performance of everything else. A good understanding of the hardware and the interface of the code with the compiler appears to be enough to (someone like him) write code that performs in the limits of the processador capacity. I read somewhere that part of that performance comes from generated code at runtime for the types and sizes of the variables in case, something more natural to be done in a language like Julia.
In my (very limited) understanding assembler code is only efficient if written explicitly for the concrete hardware / microarchitecture you are using it. You have to hard-code e.g. SIMD usage, therefore if your program should run on AVX2 and AVX512, you cannot make use of the new AVX512 features, resulting in not optimal code for the latter. Of course you could program both versions and choose the optimal one on run-time dependent on your microarchitecture, but this is a lot of work.
A compiler, in contrast, may optimize C or Julia code for your specific microarchitecture, which may be more efficient in some cases.
@paulmelis this is a great topic for discussion. I work in HPC and I agree that there is ‘folk wisdom’ that the Intel compilers buy you performance. That was certainly the case in the past, and probably is true today.
I would love to see that Julia comparison between latest generation CPUs also.
Moving the discussion along a little, in the past on this forum we have seen benchmarking reports with Julia code which in the end turn out to be dominated by the BLAS library performance. Depends of course on the code! I think Intel put a lot of effort into those optimised math libraries and that was what wins them performance.
Simply putting LoopVectorization.@tturbo on three for loops manages to do well until we run out of L2 cache. Results will differ by CPU of course. That particular CPU has much more L2 cache than most (18 cores, 1 MiB/core).
MKL starts to run away from the competition around 3000x3000 on that computer.
On AMD systems, the small size performance of Julia-matrix multiply is unmatched:
The DifferentialEquations ecosystem doesn’t really do any hardware-specific optimizations itself yet, but it (and ModelingToolkit in particular) is another great example of how code generation can be leveraged for better problem solving in Julia.
A long-term goal of mine is to work on an SLP vectorizer it can use, as well as a SPMD compiler like ISPC. DifferentialEquations would be the target for both of these, but they should be usable by interested Julia projects more generally.
But for now, I still have a lot of loop work ahead of me (in particular, modeling much more complicated loops so they can be optimized and still produce correct results).
For general compiler optimizations, I don’t think high-level language differs much from low-level languages because a lot of these optimizations happens at a rather low level. They must be done at low level because these kinds of program transformation replies on some specific program properties, such as purity and invariant. These properties are much easy to be proved at low level, because high level IR is more complicated than low level one. A compiler designer working at high level IR has to cover more control-flow structure and operators. Also, to use this kind of high level information, he must be careful to preserve the information during code transformation and code lowering, which is really hard.
What you said here reminds me that a few months ago, I read an article about Rust on Hacker News claiming that Rust can be faster than C++ because ownership can improve result of alias analysis and lead to better code generation. Then a people who work on these types of compiler optimization disagreed with him and showed that C++ can actually do that even without a ownership system…
icc is not officially supported for building Julia base. It won’t buy you much anyways, since it only affects the Julia runtime, which typically doesn’t handle anything performance sensitive.
Given the different cries of LLVM becoming noticeably slower and slower it seems any speedup in the LLVM area would be appreciated. So perhaps building Julia and LLVM with Intel ICC could actually have some (but probably small) impact on overall Julia performance.
I was thinking more is smaller and less binaries than a difference in performance. I remember a time when Octave was build both with Visual Studio and cross-compiled. At that time the VS bin was 20 MB and other 200 Mb. Ofc I’m not expecting a 10x time difference but even if it would be only half size that would be already a big win. Not specially for now but for when one are able to create stand-alone program and not having to carry hundreds of megabytes large files even for small compiled programs.
I’m just ccalling dgemm through MKL_jll.
PRs to improve it or alternatives are welcome.
Note at those small sizes (<100) MKL is much faster than OpenBLAS and BLIS on the Intel CPU (and performs about the same as those two on AMD).
I don’t need to run myself.
I just think that if one compares to other system to show it has better solution one must bring the other system to its optimal state.
Just as you should do in papers. When comparing to others we either take the optimal parameters or ask the developer to bring its best.
The overhead of MKL is big. I am sure you use @inbounds and remove other checks in the Julia code.
We’ll be more than happy to have a neutral package to beat MKL but we need to beat MKL’s best.
It’s time to have the best BLAS performance in open source form.
Octavian isn’t specializing on matrix size, so if you want to use MKL’s JIT, then we should use statically sized StrideArrays.StrideArrays for Octavian.
Do you know how to optimize the MKL run in details, and what MKL_jll is not doing? That would be super helpful not only for the people developing the alternatives, but to everyone else using it.
The direct call was much faster (90 ns vs 179 ns), but still much slower than Octavian’s matmul! (30 ns).
Things other than the direct call, such as the size-specializing JIT or prepacking, would only be fair if we do the same in Julia.
It should be hard to beat Julia in the JIT department:
@Mason , On the contrary. I think @Elrod’s work is amazing. I will be the first to be using this instead of MKL. I don’t like MKL not being open and discriminating AMD (Though it seems they are working on it, but slowly).
My only argument is if we want to show we beat some other software we need to beat it at its prime.
What are the reactions at this forum when someone shows Julia is beaten by others and the Julia code isn’t optimized? We say it is not fair.
If we want to beat MKL we need to beat it on its best. It seems @Elrod is doing that. It just needs to be shown.