In a discussion today with some colleagues the subject of the Intel C/C++ compiler came up. The conventional wisdom in HPC still seems to be that Intel C/C++ produces native code with higher performance than, say, GCC or Clang, for the same set of input C/C++ sources (and associated set of MPI libraries, etc). At least this is to be expected on Intel CPUs, but possibly also on AMD. So you essentially get performance for free by buying a license to the Intel compiler and rebuilding all your applications. Iβm having a hard time finding a systematic comparison that shows this, although some anecdotal evidence exists, suggesting a 10-20% performance increase isnβt uncommon.
Regardless of whether this is all (still) true, I was wondering:
As C and C++ arenβt the easiest languages for a compiler to optimize it probably isnβt surprising that a compiler that works harder on optimizing it produces better code, together with the in-depth knowledge Intel engineers will have on their CPUs to take advantage of. But does Julia as a language have a head-start on C/C++ in terms of lower optimization complexity and can it therefore match performance of code built with Intel C/C++ for the same implementation in Julia? I know this is comparing apples to oranges, but I hope the gist of my question is clear, especially since Julia positions itself as an alternative to C/C++ (or even Fortran, for which Intel also has a compiler). If it is clear that βIntel C/C++ performanceβ can be attained with Julia for free then this can be an incentive for HPC centers and such to invest more in promoting Julia with users, compared to buying compiler licenses.
Are there noticeable performance differences for Julia between βequivalentβ Intel and AMD CPUs? Again a somewhat ill-defined question, but given CPUs with the same specs in terms of FLOPS, memory bandwidth, caches, etc would one expect a significant difference in performance? I guess this depends mostly on code quality as generated by LLVM. But then again, the newer versions of the Intel compiler also seem to be based on LLVM, although the Intel engineers have probably added a lot of custom optimizations.
I believe in theory a high level language compiler could generate code for an algorithm that is faster than the C/C++ compiler if it has more information about the data objects and usage than the C/C++ compiler is allowed to assume. I donβt know if Julia has that information or not. I suspect in certain cases the answer is yes, but probably more often the answer is no.
This question also getβs into the βfast enoughβ question. As programmers (in general) we are not going for the fastest execution of code, we are going for fast enough. If we wanted the absolute fasted weβd probably be writing in assembler or compiling down to assembler then hand tuning that code.
As an example of that, I was looking at the BLAKE3 reference implementation in Rust. They have a native Rust implementation, then a faster implementation that uses SIMD instructions in a C library, then the fastest implementation that uses SIMD instructions in an assembler file. They obviously felt that the C compiler was NOT fast enough.
So the βcorrectβ question for the general programming population is does the language let me write code fast, is that code easy to maintain, and does it run it fast enough.
Probably not the answers you were looking for, but donβt think Julia will every run βrealβ code faster than C++ code and the Intel compiler.
I wander what Chris Elrod would have to say about this. He seems to be writing stuff in Julia that is beating the performance of everything else. A good understanding of the hardware and the interface of the code with the compiler appears to be enough to (someone like him) write code that performs in the limits of the processador capacity. I read somewhere that part of that performance comes from generated code at runtime for the types and sizes of the variables in case, something more natural to be done in a language like Julia.
In my (very limited) understanding assembler code is only efficient if written explicitly for the concrete hardware / microarchitecture you are using it. You have to hard-code e.g. SIMD usage, therefore if your program should run on AVX2 and AVX512, you cannot make use of the new AVX512 features, resulting in not optimal code for the latter. Of course you could program both versions and choose the optimal one on run-time dependent on your microarchitecture, but this is a lot of work.
A compiler, in contrast, may optimize C or Julia code for your specific microarchitecture, which may be more efficient in some cases.
@paulmelis this is a great topic for discussion. I work in HPC and I agree that there is βfolk wisdomβ that the Intel compilers buy you performance. That was certainly the case in the past, and probably is true today.
I would love to see that Julia comparison between latest generation CPUs also.
Moving the discussion along a little, in the past on this forum we have seen benchmarking reports with Julia code which in the end turn out to be dominated by the BLAS library performance. Depends of course on the code! I think Intel put a lot of effort into those optimised math libraries and that was what wins them performance.
Simply putting LoopVectorization.@tturbo on three for loops manages to do well until we run out of L2 cache. Results will differ by CPU of course. That particular CPU has much more L2 cache than most (18 cores, 1 MiB/core).
MKL starts to run away from the competition around 3000x3000 on that computer.
On AMD systems, the small size performance of Julia-matrix multiply is unmatched:
The DifferentialEquations ecosystem doesnβt really do any hardware-specific optimizations itself yet, but it (and ModelingToolkit in particular) is another great example of how code generation can be leveraged for better problem solving in Julia.
A long-term goal of mine is to work on an SLP vectorizer it can use, as well as a SPMD compiler like ISPC. DifferentialEquations would be the target for both of these, but they should be usable by interested Julia projects more generally.
But for now, I still have a lot of loop work ahead of me (in particular, modeling much more complicated loops so they can be optimized and still produce correct results).
For general compiler optimizations, I donβt think high-level language differs much from low-level languages because a lot of these optimizations happens at a rather low level. They must be done at low level because these kinds of program transformation replies on some specific program properties, such as purity and invariant. These properties are much easy to be proved at low level, because high level IR is more complicated than low level one. A compiler designer working at high level IR has to cover more control-flow structure and operators. Also, to use this kind of high level information, he must be careful to preserve the information during code transformation and code lowering, which is really hard.
What you said here reminds me that a few months ago, I read an article about Rust on Hacker News claiming that Rust can be faster than C++ because ownership can improve result of alias analysis and lead to better code generation. Then a people who work on these types of compiler optimization disagreed with him and showed that C++ can actually do that even without a ownership systemβ¦
icc is not officially supported for building Julia base. It wonβt buy you much anyways, since it only affects the Julia runtime, which typically doesnβt handle anything performance sensitive.
Given the different cries of LLVM becoming noticeably slower and slower it seems any speedup in the LLVM area would be appreciated. So perhaps building Julia and LLVM with Intel ICC could actually have some (but probably small) impact on overall Julia performance.
I was thinking more is smaller and less binaries than a difference in performance. I remember a time when Octave was build both with Visual Studio and cross-compiled. At that time the VS bin was 20 MB and other 200 Mb. Ofc Iβm not expecting a 10x time difference but even if it would be only half size that would be already a big win. Not specially for now but for when one are able to create stand-alone program and not having to carry hundreds of megabytes large files even for small compiled programs.
Iβm just ccalling dgemm through MKL_jll.
PRs to improve it or alternatives are welcome.
Note at those small sizes (<100) MKL is much faster than OpenBLAS and BLIS on the Intel CPU (and performs about the same as those two on AMD).
I donβt need to run myself.
I just think that if one compares to other system to show it has better solution one must bring the other system to its optimal state.
Just as you should do in papers. When comparing to others we either take the optimal parameters or ask the developer to bring its best.
The overhead of MKL is big. I am sure you use @inbounds and remove other checks in the Julia code.
Weβll be more than happy to have a neutral package to beat MKL but we need to beat MKLβs best.
Itβs time to have the best BLAS performance in open source form.
Octavian isnβt specializing on matrix size, so if you want to use MKLβs JIT, then we should use statically sized StrideArrays.StrideArrays for Octavian.
Do you know how to optimize the MKL run in details, and what MKL_jll is not doing? That would be super helpful not only for the people developing the alternatives, but to everyone else using it.
The direct call was much faster (90 ns vs 179 ns), but still much slower than Octavianβs matmul! (30 ns).
Things other than the direct call, such as the size-specializing JIT or prepacking, would only be fair if we do the same in Julia.
It should be hard to beat Julia in the JIT department:
@Mason , On the contrary. I think @Elrodβs work is amazing. I will be the first to be using this instead of MKL. I donβt like MKL not being open and discriminating AMD (Though it seems they are working on it, but slowly).
My only argument is if we want to show we beat some other software we need to beat it at its prime.
What are the reactions at this forum when someone shows Julia is beaten by others and the Julia code isnβt optimized? We say it is not fair.
If we want to beat MKL we need to beat it on its best. It seems @Elrod is doing that. It just needs to be shown.