Intel C/C++ compiler performance versus Julia

In a discussion today with some colleagues the subject of the Intel C/C++ compiler came up. The conventional wisdom in HPC still seems to be that Intel C/C++ produces native code with higher performance than, say, GCC or Clang, for the same set of input C/C++ sources (and associated set of MPI libraries, etc). At least this is to be expected on Intel CPUs, but possibly also on AMD. So you essentially get performance for free by buying a license to the Intel compiler and rebuilding all your applications. I’m having a hard time finding a systematic comparison that shows this, although some anecdotal evidence exists, suggesting a 10-20% performance increase isn’t uncommon.

Regardless of whether this is all (still) true, I was wondering:

  • As C and C++ aren’t the easiest languages for a compiler to optimize it probably isn’t surprising that a compiler that works harder on optimizing it produces better code, together with the in-depth knowledge Intel engineers will have on their CPUs to take advantage of. But does Julia as a language have a head-start on C/C++ in terms of lower optimization complexity and can it therefore match performance of code built with Intel C/C++ for the same implementation in Julia? I know this is comparing apples to oranges, but I hope the gist of my question is clear, especially since Julia positions itself as an alternative to C/C++ (or even Fortran, for which Intel also has a compiler). If it is clear that β€œIntel C/C++ performance” can be attained with Julia for free then this can be an incentive for HPC centers and such to invest more in promoting Julia with users, compared to buying compiler licenses.

  • Are there noticeable performance differences for Julia between β€œequivalent” Intel and AMD CPUs? Again a somewhat ill-defined question, but given CPUs with the same specs in terms of FLOPS, memory bandwidth, caches, etc would one expect a significant difference in performance? I guess this depends mostly on code quality as generated by LLVM. But then again, the newer versions of the Intel compiler also seem to be based on LLVM, although the Intel engineers have probably added a lot of custom optimizations.

6 Likes

I believe in theory a high level language compiler could generate code for an algorithm that is faster than the C/C++ compiler if it has more information about the data objects and usage than the C/C++ compiler is allowed to assume. I don’t know if Julia has that information or not. I suspect in certain cases the answer is yes, but probably more often the answer is no.

This question also get’s into the β€œfast enough” question. As programmers (in general) we are not going for the fastest execution of code, we are going for fast enough. If we wanted the absolute fasted we’d probably be writing in assembler or compiling down to assembler then hand tuning that code.

As an example of that, I was looking at the BLAKE3 reference implementation in Rust. They have a native Rust implementation, then a faster implementation that uses SIMD instructions in a C library, then the fastest implementation that uses SIMD instructions in an assembler file. They obviously felt that the C compiler was NOT fast enough.

So the β€œcorrect” question for the general programming population is does the language let me write code fast, is that code easy to maintain, and does it run it fast enough.

Probably not the answers you were looking for, but don’t think Julia will every run β€œreal” code faster than C++ code and the Intel compiler.

9 Likes

A side question is if the Intel compiler could be used to build Julia on Windows.

I wander what Chris Elrod would have to say about this. He seems to be writing stuff in Julia that is beating the performance of everything else. A good understanding of the hardware and the interface of the code with the compiler appears to be enough to (someone like him) write code that performs in the limits of the processador capacity. I read somewhere that part of that performance comes from generated code at runtime for the types and sizes of the variables in case, something more natural to be done in a language like Julia.

9 Likes

In my (very limited) understanding assembler code is only efficient if written explicitly for the concrete hardware / microarchitecture you are using it. You have to hard-code e.g. SIMD usage, therefore if your program should run on AVX2 and AVX512, you cannot make use of the new AVX512 features, resulting in not optimal code for the latter. Of course you could program both versions and choose the optimal one on run-time dependent on your microarchitecture, but this is a lot of work.
A compiler, in contrast, may optimize C or Julia code for your specific microarchitecture, which may be more efficient in some cases.

5 Likes

@paulmelis this is a great topic for discussion. I work in HPC and I agree that there is β€˜folk wisdom’ that the Intel compilers buy you performance. That was certainly the case in the past, and probably is true today.
I would love to see that Julia comparison between latest generation CPUs also.

Moving the discussion along a little, in the past on this forum we have seen benchmarking reports with Julia code which in the end turn out to be dominated by the BLAS library performance. Depends of course on the code! I think Intel put a lot of effort into those optimised math libraries and that was what wins them performance.

6 Likes

Multithreaded Julia matrix multiply (specifically, Octavian.jl) β€œdominates” MKL <100x100, and hangs on in performance

Simply putting LoopVectorization.@tturbo on three for loops manages to do well until we run out of L2 cache. Results will differ by CPU of course. That particular CPU has much more L2 cache than most (18 cores, 1 MiB/core).

MKL starts to run away from the competition around 3000x3000 on that computer.
On AMD systems, the small size performance of Julia-matrix multiply is unmatched:


This was on a Ryzen 4900HS.

The DifferentialEquations ecosystem doesn’t really do any hardware-specific optimizations itself yet, but it (and ModelingToolkit in particular) is another great example of how code generation can be leveraged for better problem solving in Julia.
A long-term goal of mine is to work on an SLP vectorizer it can use, as well as a SPMD compiler like ISPC. DifferentialEquations would be the target for both of these, but they should be usable by interested Julia projects more generally.

But for now, I still have a lot of loop work ahead of me (in particular, modeling much more complicated loops so they can be optimized and still produce correct results).

29 Likes

For general compiler optimizations, I don’t think high-level language differs much from low-level languages because a lot of these optimizations happens at a rather low level. They must be done at low level because these kinds of program transformation replies on some specific program properties, such as purity and invariant. These properties are much easy to be proved at low level, because high level IR is more complicated than low level one. A compiler designer working at high level IR has to cover more control-flow structure and operators. Also, to use this kind of high level information, he must be careful to preserve the information during code transformation and code lowering, which is really hard.
What you said here reminds me that a few months ago, I read an article about Rust on Hacker News claiming that Rust can be faster than C++ because ownership can improve result of alias analysis and lead to better code generation. Then a people who work on these types of compiler optimization disagreed with him and showed that C++ can actually do that even without a ownership system…

2 Likes

icc is not officially supported for building Julia base. It won’t buy you much anyways, since it only affects the Julia runtime, which typically doesn’t handle anything performance sensitive.

1 Like

Do note that there is a fair amount of Intel activity in contributing to LLVM. Here is Intel’s staging area for such contributions:

Given the different cries of LLVM becoming noticeably slower and slower it seems any speedup in the LLVM area would be appreciated. So perhaps building Julia and LLVM with Intel ICC could actually have some (but probably small) impact on overall Julia performance.

1 Like

I was thinking more is smaller and less binaries than a difference in performance. I remember a time when Octave was build both with Visual Studio and cross-compiled. At that time the VS bin was 20 MB and other 200 Mb. Ofc I’m not expecting a 10x time difference but even if it would be only half size that would be already a big win. Not specially for now but for when one are able to create stand-alone program and not having to carry hundreds of megabytes large files even for small compiled programs.

Are you sure you’re using MKL in its optimized form?
Namely using all its features to handle smaller matrices and reduce the overhead?

I’m just ccalling dgemm through MKL_jll.
PRs to improve it or alternatives are welcome.
Note at those small sizes (<100) MKL is much faster than OpenBLAS and BLIS on the Intel CPU (and performs about the same as those two on AMD).

To run benchmarks yourself:

using BLASBenchmarksCPU
rb = runbench(libs = [:Octavian,:MKL,:blis], sizes = logspace(10, 10_000, 300));
plot(rb, displayplot = false);
3 Likes

I don’t need to run myself.
I just think that if one compares to other system to show it has better solution one must bring the other system to its optimal state.

Just as you should do in papers. When comparing to others we either take the optimal parameters or ask the developer to bring its best.

The overhead of MKL is big. I am sure you use @inbounds and remove other checks in the Julia code.
We’ll be more than happy to have a neutral package to beat MKL but we need to beat MKL’s best.
It’s time to have the best BLAS performance in open source form.

2 Likes
julia> using Octavian

julia> Octavian.matmul!(rand(2,3), rand(3,4), rand(3,4))
ERROR: AssertionError: Size mismatch.
Stacktrace:
 [1] matmul_sizes
   @ ~/.julia/dev/Octavian/src/utils.jl:15 [inlined]
 [2] _matmul!
   @ ~/.julia/dev/Octavian/src/matmul.jl:247 [inlined]
 [3] matmul!
   @ ~/.julia/dev/Octavian/src/matmul.jl:236 [inlined]
 [4] matmul!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64})
   @ Octavian ~/.julia/dev/Octavian/src/matmul.jl:220
 [5] top-level scope
   @ REPL[6]:1

Anyway, a few size checks don’t take 700 nanoseconds. This PR adds gemm_direct support:

julia> using Octavian, BLASBenchmarksCPU, BenchmarkHistograms

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);

julia> @benchmark gemmmkl_direct!($C0,$A,$B)
samples: 10000; evals/sample: 9; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2710.0 - 2830.0]  β–ˆβ–Ž227
 (2830.0 - 2950.0]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹2099
 (2950.0 - 3070.0]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 5985
 (3070.0 - 3200.0]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š1545
 (3200.0 - 3320.0]  β–‹111
 (3320.0 - 3440.0]  ▏6
 (3440.0 - 3560.0]  ▏2
 (3560.0 - 3680.0]  ▏1
 (3680.0 - 3800.0]   0
 (3800.0 - 3920.0]   0
 (3920.0 - 4040.0]  ▏1
 (4040.0 - 4160.0]  ▏1
 (4160.0 - 4280.0]  ▏7
 (4280.0 - 4400.0]  ▏5
 (4400.0 - 8870.0]  ▏10

                  Counts

min: 2.712 ΞΌs (0.00% GC); mean: 3.011 ΞΌs (0.00% GC); median: 3.015 ΞΌs (0.00% GC); max: 8.875 ΞΌs (0.00% GC).

julia> @benchmark gemmmkl!($C0,$A,$B)
samples: 10000; evals/sample: 9; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2660.0 - 2780.0]  β–ˆβ–‹339
 (2780.0 - 2900.0]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ1171
 (2900.0 - 3020.0]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 6427
 (3020.0 - 3140.0]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š1865
 (3140.0 - 3260.0]  β–Š155
 (3260.0 - 3380.0]  ▏16
 (3380.0 - 3500.0]  ▏2
 (3500.0 - 3620.0]   0
 (3620.0 - 3740.0]   0
 (3740.0 - 3860.0]   0
 (3860.0 - 3980.0]   0
 (3980.0 - 4100.0]  ▏1
 (4100.0 - 4220.0]  ▏6
 (4220.0 - 4340.0]  ▏8
 (4340.0 - 9160.0]  ▏10

                  Counts

min: 2.658 ΞΌs (0.00% GC); mean: 2.967 ΞΌs (0.00% GC); median: 2.972 ΞΌs (0.00% GC); max: 9.162 ΞΌs (0.00% GC).

julia> @benchmark matmul!($C0,$A,$B)
samples: 10000; evals/sample: 10; memory estimate: 0 bytes; allocs estimate: 0
ns

 (1955.0 - 1997.0 ]  β–ˆβ–ˆβ–381
 (1997.0 - 2039.0 ]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ1915
 (2039.0 - 2080.0 ]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 5048
 (2080.0 - 2122.0 ]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž2387
 (2122.0 - 2163.0 ]  β–‹87
 (2163.0 - 2205.0 ]  β–Œ79
 (2205.0 - 2247.0 ]  β–Œ76
 (2247.0 - 2288.0 ]  ▏11
 (2288.0 - 2330.0 ]  ▏2
 (2330.0 - 2371.0 ]  ▏1
 (2371.0 - 2413.0 ]  ▏2
 (2413.0 - 2455.0 ]   0
 (2455.0 - 2496.0 ]   0
 (2496.0 - 2538.0 ]  ▏1
 (2538.0 - 10286.0]  ▏10

                  Counts

min: 1.955 ΞΌs (0.00% GC); mean: 2.064 ΞΌs (0.00% GC); median: 2.064 ΞΌs (0.00% GC); max: 10.286 ΞΌs (0.00% GC).

julia> 2e-9*M*K*N ./ (1.955e-6, 2.658e-6, 2.712e-6) # max GFLOPS
(381.8393861892584, 280.8487584650114, 275.25663716814165)

julia> 2e-9*M*K*N ./ (2.064e-6, 2.967e-6, 3.015e-6) # median GFLOPS
(361.67441860465124, 251.59959555106173, 247.59402985074632)

Octavian isn’t specializing on matrix size, so if you want to use MKL’s JIT, then we should use statically sized StrideArrays.StrideArrays for Octavian.

10 Likes

Do you know how to optimize the MKL run in details, and what MKL_jll is not doing? That would be super helpful not only for the people developing the alternatives, but to everyone else using it.

2 Likes

MKL_DIRECT_CALL and the JIT are the two most common things Royi brings up, IIRC.
At very small sizes, the direct call helps:

julia> using BLASBenchmarksCPU, Octavian, BenchmarkHistograms

julia> M = K = N = 8; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.504561 seconds (2.53 M allocations: 133.894 MiB, 6.07% gc time, 99.99% compilation time)

julia> @benchmark matmul!($C0,$A,$B) # Octavian.jl
samples: 10000; evals/sample: 995; memory estimate: 0 bytes; allocs estimate: 0
ns

 (29.2 - 30.3]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 9590
 (30.3 - 31.4]  β–‰272
 (31.4 - 32.5]  ▏12
 (32.5 - 33.5]  ▏2
 (33.5 - 34.6]  ▏2
 (34.6 - 35.7]  ▏3
 (35.7 - 36.8]  ▏1
 (36.8 - 37.8]   0
 (37.8 - 38.9]  ▏2
 (38.9 - 40.0]  ▏37
 (40.0 - 41.1]  ▏1
 (41.1 - 42.1]   0
 (42.1 - 43.2]  ▏21
 (43.2 - 44.3]  β–Ž47
 (44.3 - 71.6]  ▏10

                  Counts

min: 29.228 ns (0.00% GC); mean: 29.660 ns (0.00% GC); median: 29.518 ns (0.00% GC); max: 71.636 ns (0.00% GC).

julia> @benchmark gemmmkl!($C0,$A,$B) # MKL dgemm
samples: 10000; evals/sample: 705; memory estimate: 0 bytes; allocs estimate: 0
ns

 (178.8 - 179.6]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 4448
 (179.6 - 180.5]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰4269
 (180.5 - 181.3]  β–ˆβ–ˆβ–ˆβ–ˆβ–‰719
 (181.3 - 182.2]  β–ˆβ–ˆβ–ˆβ–Œ514
 (182.2 - 183.1]  β–Ž26
 (183.1 - 183.9]  ▏1
 (183.9 - 184.8]   0
 (184.8 - 185.6]  ▏1
 (185.6 - 186.5]   0
 (186.5 - 187.4]  ▏1
 (187.4 - 188.2]  ▏2
 (188.2 - 189.1]  ▏1
 (189.1 - 190.0]  ▏6
 (190.0 - 190.8]  ▏2
 (190.8 - 240.1]  ▏10

                  Counts

min: 178.756 ns (0.00% GC); mean: 179.870 ns (0.00% GC); median: 179.661 ns (0.00% GC); max: 240.072 ns (0.00% GC).

julia> @benchmark gemmmkl_direct!($C0,$A,$B) # MKL dgemm_direct
samples: 10000; evals/sample: 954; memory estimate: 0 bytes; allocs estimate: 0
ns

 (90.4  - 91.2 ]  ▏6
 (91.2  - 92.0 ]  β–ˆβ–ˆβ–ˆβ–Œ934
 (92.0  - 92.7 ]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 8121
 (92.7  - 93.5 ]  β–ˆ263
 (93.5  - 94.2 ]  β–ˆβ–ˆβ–638
 (94.2  - 95.0 ]  ▏21
 (95.0  - 95.7 ]  ▏1
 (95.7  - 96.5 ]   0
 (96.5  - 97.3 ]   0
 (97.3  - 98.0 ]  ▏2
 (98.0  - 98.8 ]   0
 (98.8  - 99.5 ]  ▏3
 (99.5  - 100.3]   0
 (100.3 - 101.1]  ▏1
 (101.1 - 136.1]  ▏10

                  Counts

min: 90.436 ns (0.00% GC); mean: 92.431 ns (0.00% GC); median: 92.367 ns (0.00% GC); max: 136.131 ns (0.00% GC).

The direct call was much faster (90 ns vs 179 ns), but still much slower than Octavian’s matmul! (30 ns).
Things other than the direct call, such as the size-specializing JIT or prepacking, would only be fair if we do the same in Julia.

It should be hard to beat Julia in the JIT department:

julia> using StaticArrays

julia> Am = @MMatrix rand(8,8); Bm = @MMatrix rand(8,8); Cm = similar(Am);

julia> @benchmark matmul!($Cm,$Am,$Bm) # StaticArrays.MMatrix are statically sized
samples: 10000; evals/sample: 999; memory estimate: 0 bytes; allocs estimate: 0
ns

 (12.71 - 12.88]  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–9859
 (12.88 - 13.04]   0
 (13.04 - 13.2 ]   0
 (13.2  - 13.37]   0
 (13.37 - 13.53]   0
 (13.53 - 13.69]   0
 (13.69 - 13.86]  β–Ž63
 (13.86 - 14.02]  β–Ž51
 (14.02 - 14.18]  ▏8
 (14.18 - 14.34]  ▏4
 (14.34 - 14.51]  ▏3
 (14.51 - 14.67]   0
 (14.67 - 14.83]   0
 (14.83 - 15.0 ]  ▏2
 (15.0  - 51.84]  ▏10

                  Counts

min: 12.714 ns (0.00% GC); mean: 12.791 ns (0.00% GC); median: 12.770 ns (0.00% GC); max: 51.842 ns (0.00% GC).
14 Likes

Why not? You’re the one who seems to be assuming without checking that the benchmarks were not run fairly.

5 Likes

@Mason , On the contrary. I think @Elrod’s work is amazing. I will be the first to be using this instead of MKL. I don’t like MKL not being open and discriminating AMD (Though it seems they are working on it, but slowly).

My only argument is if we want to show we beat some other software we need to beat it at its prime.
What are the reactions at this forum when someone shows Julia is beaten by others and the Julia code isn’t optimized? We say it is not fair.

If we want to beat MKL we need to beat it on its best. It seems @Elrod is doing that. It just needs to be shown.

3 Likes