Intel C/C++ compiler performance versus Julia

paulmelis · May 27, 2021, 1:45pm

In a discussion today with some colleagues the subject of the Intel C/C++ compiler came up. The conventional wisdom in HPC still seems to be that Intel C/C++ produces native code with higher performance than, say, GCC or Clang, for the same set of input C/C++ sources (and associated set of MPI libraries, etc). At least this is to be expected on Intel CPUs, but possibly also on AMD. So you essentially get performance for free by buying a license to the Intel compiler and rebuilding all your applications. I’m having a hard time finding a systematic comparison that shows this, although some anecdotal evidence exists, suggesting a 10-20% performance increase isn’t uncommon.

Regardless of whether this is all (still) true, I was wondering:

As C and C++ aren’t the easiest languages for a compiler to optimize it probably isn’t surprising that a compiler that works harder on optimizing it produces better code, together with the in-depth knowledge Intel engineers will have on their CPUs to take advantage of. But does Julia as a language have a head-start on C/C++ in terms of lower optimization complexity and can it therefore match performance of code built with Intel C/C++ for the same implementation in Julia? I know this is comparing apples to oranges, but I hope the gist of my question is clear, especially since Julia positions itself as an alternative to C/C++ (or even Fortran, for which Intel also has a compiler). If it is clear that “Intel C/C++ performance” can be attained with Julia for free then this can be an incentive for HPC centers and such to invest more in promoting Julia with users, compared to buying compiler licenses.
Are there noticeable performance differences for Julia between “equivalent” Intel and AMD CPUs? Again a somewhat ill-defined question, but given CPUs with the same specs in terms of FLOPS, memory bandwidth, caches, etc would one expect a significant difference in performance? I guess this depends mostly on code quality as generated by LLVM. But then again, the newer versions of the Intel compiler also seem to be based on LLVM, although the Intel engineers have probably added a lot of custom optimizations.

pixel27 · May 27, 2021, 8:22pm

I believe in theory a high level language compiler could generate code for an algorithm that is faster than the C/C++ compiler if it has more information about the data objects and usage than the C/C++ compiler is allowed to assume. I don’t know if Julia has that information or not. I suspect in certain cases the answer is yes, but probably more often the answer is no.

This question also get’s into the “fast enough” question. As programmers (in general) we are not going for the fastest execution of code, we are going for fast enough. If we wanted the absolute fasted we’d probably be writing in assembler or compiling down to assembler then hand tuning that code.

As an example of that, I was looking at the BLAKE3 reference implementation in Rust. They have a native Rust implementation, then a faster implementation that uses SIMD instructions in a C library, then the fastest implementation that uses SIMD instructions in an assembler file. They obviously felt that the C compiler was NOT fast enough.

So the “correct” question for the general programming population is does the language let me write code fast, is that code easy to maintain, and does it run it fast enough.

Probably not the answers you were looking for, but don’t think Julia will every run “real” code faster than C++ code and the Intel compiler.

joa-quim · May 28, 2021, 12:18am

A side question is if the Intel compiler could be used to build Julia on Windows.

lmiq · May 28, 2021, 12:40am

I wander what Chris Elrod would have to say about this. He seems to be writing stuff in Julia that is beating the performance of everything else. A good understanding of the hardware and the interface of the code with the compiler appears to be enough to (someone like him) write code that performs in the limits of the processador capacity. I read somewhere that part of that performance comes from generated code at runtime for the types and sizes of the variables in case, something more natural to be done in a language like Julia.

lungben · May 28, 2021, 5:54am

In my (very limited) understanding assembler code is only efficient if written explicitly for the concrete hardware / microarchitecture you are using it. You have to hard-code e.g. SIMD usage, therefore if your program should run on AVX2 and AVX512, you cannot make use of the new AVX512 features, resulting in not optimal code for the latter. Of course you could program both versions and choose the optimal one on run-time dependent on your microarchitecture, but this is a lot of work.
A compiler, in contrast, may optimize C or Julia code for your specific microarchitecture, which may be more efficient in some cases.

johnh · May 28, 2021, 6:10am

@paulmelis this is a great topic for discussion. I work in HPC and I agree that there is ‘folk wisdom’ that the Intel compilers buy you performance. That was certainly the case in the past, and probably is true today.
I would love to see that Julia comparison between latest generation CPUs also.

Moving the discussion along a little, in the past on this forum we have seen benchmarking reports with Julia code which in the end turn out to be dominated by the BLAS library performance. Depends of course on the code! I think Intel put a lot of effort into those optimised math libraries and that was what wins them performance.

Elrod · May 28, 2021, 7:18am

Multithreaded Julia matrix multiply (specifically, Octavian.jl) “dominates” MKL <100x100, and hangs on in performance

Simply putting LoopVectorization.@tturbo on three for loops manages to do well until we run out of L2 cache. Results will differ by CPU of course. That particular CPU has much more L2 cache than most (18 cores, 1 MiB/core).

MKL starts to run away from the competition around 3000x3000 on that computer.
On AMD systems, the small size performance of Julia-matrix multiply is unmatched:

This was on a Ryzen 4900HS.

The DifferentialEquations ecosystem doesn’t really do any hardware-specific optimizations itself yet, but it (and ModelingToolkit in particular) is another great example of how code generation can be leveraged for better problem solving in Julia.
A long-term goal of mine is to work on an SLP vectorizer it can use, as well as a SPMD compiler like ISPC. DifferentialEquations would be the target for both of these, but they should be usable by interested Julia projects more generally.

But for now, I still have a lot of loop work ahead of me (in particular, modeling much more complicated loops so they can be optimized and still produce correct results).

anon56330260 · May 28, 2021, 8:16am

For general compiler optimizations, I don’t think high-level language differs much from low-level languages because a lot of these optimizations happens at a rather low level. They must be done at low level because these kinds of program transformation replies on some specific program properties, such as purity and invariant. These properties are much easy to be proved at low level, because high level IR is more complicated than low level one. A compiler designer working at high level IR has to cover more control-flow structure and operators. Also, to use this kind of high level information, he must be careful to preserve the information during code transformation and code lowering, which is really hard.
What you said here reminds me that a few months ago, I read an article about Rust on Hacker News claiming that Rust can be faster than C++ because ownership can improve result of alias analysis and lead to better code generation. Then a people who work on these types of compiler optimization disagreed with him and showed that C++ can actually do that even without a ownership system…

simeonschaub · May 28, 2021, 10:15am

icc is not officially supported for building Julia base. It won’t buy you much anyways, since it only affects the Julia runtime, which typically doesn’t handle anything performance sensitive.

mkitti · May 28, 2021, 10:22am

Do note that there is a fair amount of Intel activity in contributing to LLVM. Here is Intel’s staging area for such contributions:

paulmelis · May 28, 2021, 11:20am

Given the different cries of LLVM becoming noticeably slower and slower it seems any speedup in the LLVM area would be appreciated. So perhaps building Julia and LLVM with Intel ICC could actually have some (but probably small) impact on overall Julia performance.

joa-quim · May 28, 2021, 1:14pm

I was thinking more is smaller and less binaries than a difference in performance. I remember a time when Octave was build both with Visual Studio and cross-compiled. At that time the VS bin was 20 MB and other 200 Mb. Ofc I’m not expecting a 10x time difference but even if it would be only half size that would be already a big win. Not specially for now but for when one are able to create stand-alone program and not having to carry hundreds of megabytes large files even for small compiled programs.

RoyiAvital · May 28, 2021, 1:57pm

Are you sure you’re using MKL in its optimized form?
Namely using all its features to handle smaller matrices and reduce the overhead?

Elrod · May 28, 2021, 2:00pm

I’m just ccalling dgemm through MKL_jll.
PRs to improve it or alternatives are welcome.
Note at those small sizes (<100) MKL is much faster than OpenBLAS and BLIS on the Intel CPU (and performs about the same as those two on AMD).

To run benchmarks yourself:

using BLASBenchmarksCPU
rb = runbench(libs = [:Octavian,:MKL,:blis], sizes = logspace(10, 10_000, 300));
plot(rb, displayplot = false);

RoyiAvital · May 28, 2021, 2:17pm

I don’t need to run myself.
I just think that if one compares to other system to show it has better solution one must bring the other system to its optimal state.

Just as you should do in papers. When comparing to others we either take the optimal parameters or ask the developer to bring its best.

The overhead of MKL is big. I am sure you use @inbounds and remove other checks in the Julia code.
We’ll be more than happy to have a neutral package to beat MKL but we need to beat MKL’s best.
It’s time to have the best BLAS performance in open source form.

Elrod · May 28, 2021, 3:21pm

julia> using Octavian

julia> Octavian.matmul!(rand(2,3), rand(3,4), rand(3,4))
ERROR: AssertionError: Size mismatch.
Stacktrace:
 [1] matmul_sizes
   @ ~/.julia/dev/Octavian/src/utils.jl:15 [inlined]
 [2] _matmul!
   @ ~/.julia/dev/Octavian/src/matmul.jl:247 [inlined]
 [3] matmul!
   @ ~/.julia/dev/Octavian/src/matmul.jl:236 [inlined]
 [4] matmul!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64})
   @ Octavian ~/.julia/dev/Octavian/src/matmul.jl:220
 [5] top-level scope
   @ REPL[6]:1

Anyway, a few size checks don’t take 700 nanoseconds. This PR adds gemm_direct support:

julia> using Octavian, BLASBenchmarksCPU, BenchmarkHistograms

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);

julia> @benchmark gemmmkl_direct!($C0,$A,$B)
samples: 10000; evals/sample: 9; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2710.0 - 2830.0]  █▎227
 (2830.0 - 2950.0]  ██████████▋2099
 (2950.0 - 3070.0]  ██████████████████████████████ 5985
 (3070.0 - 3200.0]  ███████▊1545
 (3200.0 - 3320.0]  ▋111
 (3320.0 - 3440.0]  ▏6
 (3440.0 - 3560.0]  ▏2
 (3560.0 - 3680.0]  ▏1
 (3680.0 - 3800.0]   0
 (3800.0 - 3920.0]   0
 (3920.0 - 4040.0]  ▏1
 (4040.0 - 4160.0]  ▏1
 (4160.0 - 4280.0]  ▏7
 (4280.0 - 4400.0]  ▏5
 (4400.0 - 8870.0]  ▏10

                  Counts

min: 2.712 μs (0.00% GC); mean: 3.011 μs (0.00% GC); median: 3.015 μs (0.00% GC); max: 8.875 μs (0.00% GC).

julia> @benchmark gemmmkl!($C0,$A,$B)
samples: 10000; evals/sample: 9; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2660.0 - 2780.0]  █▋339
 (2780.0 - 2900.0]  █████▌1171
 (2900.0 - 3020.0]  ██████████████████████████████ 6427
 (3020.0 - 3140.0]  ████████▊1865
 (3140.0 - 3260.0]  ▊155
 (3260.0 - 3380.0]  ▏16
 (3380.0 - 3500.0]  ▏2
 (3500.0 - 3620.0]   0
 (3620.0 - 3740.0]   0
 (3740.0 - 3860.0]   0
 (3860.0 - 3980.0]   0
 (3980.0 - 4100.0]  ▏1
 (4100.0 - 4220.0]  ▏6
 (4220.0 - 4340.0]  ▏8
 (4340.0 - 9160.0]  ▏10

                  Counts

min: 2.658 μs (0.00% GC); mean: 2.967 μs (0.00% GC); median: 2.972 μs (0.00% GC); max: 9.162 μs (0.00% GC).

julia> @benchmark matmul!($C0,$A,$B)
samples: 10000; evals/sample: 10; memory estimate: 0 bytes; allocs estimate: 0
ns

 (1955.0 - 1997.0 ]  ██▍381
 (1997.0 - 2039.0 ]  ███████████▌1915
 (2039.0 - 2080.0 ]  ██████████████████████████████ 5048
 (2080.0 - 2122.0 ]  ██████████████▎2387
 (2122.0 - 2163.0 ]  ▋87
 (2163.0 - 2205.0 ]  ▌79
 (2205.0 - 2247.0 ]  ▌76
 (2247.0 - 2288.0 ]  ▏11
 (2288.0 - 2330.0 ]  ▏2
 (2330.0 - 2371.0 ]  ▏1
 (2371.0 - 2413.0 ]  ▏2
 (2413.0 - 2455.0 ]   0
 (2455.0 - 2496.0 ]   0
 (2496.0 - 2538.0 ]  ▏1
 (2538.0 - 10286.0]  ▏10

                  Counts

min: 1.955 μs (0.00% GC); mean: 2.064 μs (0.00% GC); median: 2.064 μs (0.00% GC); max: 10.286 μs (0.00% GC).

julia> 2e-9*M*K*N ./ (1.955e-6, 2.658e-6, 2.712e-6) # max GFLOPS
(381.8393861892584, 280.8487584650114, 275.25663716814165)

julia> 2e-9*M*K*N ./ (2.064e-6, 2.967e-6, 3.015e-6) # median GFLOPS
(361.67441860465124, 251.59959555106173, 247.59402985074632)

Octavian isn’t specializing on matrix size, so if you want to use MKL’s JIT, then we should use statically sized StrideArrays.StrideArrays for Octavian.

lmiq · May 28, 2021, 4:39pm

Do you know how to optimize the MKL run in details, and what MKL_jll is not doing? That would be super helpful not only for the people developing the alternatives, but to everyone else using it.

Elrod · May 28, 2021, 4:49pm

MKL_DIRECT_CALL and the JIT are the two most common things Royi brings up, IIRC.
At very small sizes, the direct call helps:

julia> using BLASBenchmarksCPU, Octavian, BenchmarkHistograms

julia> M = K = N = 8; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.504561 seconds (2.53 M allocations: 133.894 MiB, 6.07% gc time, 99.99% compilation time)

julia> @benchmark matmul!($C0,$A,$B) # Octavian.jl
samples: 10000; evals/sample: 995; memory estimate: 0 bytes; allocs estimate: 0
ns

 (29.2 - 30.3]  ██████████████████████████████ 9590
 (30.3 - 31.4]  ▉272
 (31.4 - 32.5]  ▏12
 (32.5 - 33.5]  ▏2
 (33.5 - 34.6]  ▏2
 (34.6 - 35.7]  ▏3
 (35.7 - 36.8]  ▏1
 (36.8 - 37.8]   0
 (37.8 - 38.9]  ▏2
 (38.9 - 40.0]  ▏37
 (40.0 - 41.1]  ▏1
 (41.1 - 42.1]   0
 (42.1 - 43.2]  ▏21
 (43.2 - 44.3]  ▎47
 (44.3 - 71.6]  ▏10

                  Counts

min: 29.228 ns (0.00% GC); mean: 29.660 ns (0.00% GC); median: 29.518 ns (0.00% GC); max: 71.636 ns (0.00% GC).

julia> @benchmark gemmmkl!($C0,$A,$B) # MKL dgemm
samples: 10000; evals/sample: 705; memory estimate: 0 bytes; allocs estimate: 0
ns

 (178.8 - 179.6]  ██████████████████████████████ 4448
 (179.6 - 180.5]  ████████████████████████████▉4269
 (180.5 - 181.3]  ████▉719
 (181.3 - 182.2]  ███▌514
 (182.2 - 183.1]  ▎26
 (183.1 - 183.9]  ▏1
 (183.9 - 184.8]   0
 (184.8 - 185.6]  ▏1
 (185.6 - 186.5]   0
 (186.5 - 187.4]  ▏1
 (187.4 - 188.2]  ▏2
 (188.2 - 189.1]  ▏1
 (189.1 - 190.0]  ▏6
 (190.0 - 190.8]  ▏2
 (190.8 - 240.1]  ▏10

                  Counts

min: 178.756 ns (0.00% GC); mean: 179.870 ns (0.00% GC); median: 179.661 ns (0.00% GC); max: 240.072 ns (0.00% GC).

julia> @benchmark gemmmkl_direct!($C0,$A,$B) # MKL dgemm_direct
samples: 10000; evals/sample: 954; memory estimate: 0 bytes; allocs estimate: 0
ns

 (90.4  - 91.2 ]  ▏6
 (91.2  - 92.0 ]  ███▌934
 (92.0  - 92.7 ]  ██████████████████████████████ 8121
 (92.7  - 93.5 ]  █263
 (93.5  - 94.2 ]  ██▍638
 (94.2  - 95.0 ]  ▏21
 (95.0  - 95.7 ]  ▏1
 (95.7  - 96.5 ]   0
 (96.5  - 97.3 ]   0
 (97.3  - 98.0 ]  ▏2
 (98.0  - 98.8 ]   0
 (98.8  - 99.5 ]  ▏3
 (99.5  - 100.3]   0
 (100.3 - 101.1]  ▏1
 (101.1 - 136.1]  ▏10

                  Counts

min: 90.436 ns (0.00% GC); mean: 92.431 ns (0.00% GC); median: 92.367 ns (0.00% GC); max: 136.131 ns (0.00% GC).

The direct call was much faster (90 ns vs 179 ns), but still much slower than Octavian’s matmul! (30 ns).
Things other than the direct call, such as the size-specializing JIT or prepacking, would only be fair if we do the same in Julia.

It should be hard to beat Julia in the JIT department:

julia> using StaticArrays

julia> Am = @MMatrix rand(8,8); Bm = @MMatrix rand(8,8); Cm = similar(Am);

julia> @benchmark matmul!($Cm,$Am,$Bm) # StaticArrays.MMatrix are statically sized
samples: 10000; evals/sample: 999; memory estimate: 0 bytes; allocs estimate: 0
ns

 (12.71 - 12.88]  ██████████████████████████████▏9859
 (12.88 - 13.04]   0
 (13.04 - 13.2 ]   0
 (13.2  - 13.37]   0
 (13.37 - 13.53]   0
 (13.53 - 13.69]   0
 (13.69 - 13.86]  ▎63
 (13.86 - 14.02]  ▎51
 (14.02 - 14.18]  ▏8
 (14.18 - 14.34]  ▏4
 (14.34 - 14.51]  ▏3
 (14.51 - 14.67]   0
 (14.67 - 14.83]   0
 (14.83 - 15.0 ]  ▏2
 (15.0  - 51.84]  ▏10

                  Counts

min: 12.714 ns (0.00% GC); mean: 12.791 ns (0.00% GC); median: 12.770 ns (0.00% GC); max: 51.842 ns (0.00% GC).

Mason · May 28, 2021, 6:06pm

Why not? You’re the one who seems to be assuming without checking that the benchmarks were not run fairly.

RoyiAvital · May 29, 2021, 6:42am

@Mason , On the contrary. I think @Elrod’s work is amazing. I will be the first to be using this instead of MKL. I don’t like MKL not being open and discriminating AMD (Though it seems they are working on it, but slowly).

My only argument is if we want to show we beat some other software we need to beat it at its prime.
What are the reactions at this forum when someone shows Julia is beaten by others and the Julia code isn’t optimized? We say it is not fair.

If we want to beat MKL we need to beat it on its best. It seems @Elrod is doing that. It just needs to be shown.

Topic		Replies	Views
Using Intel LLVM compiler with Julia Performance question	6	720	April 5, 2022
Building Julia: Does the compiler matter? Performance mkl , build	3	1800	November 19, 2018
Nice work on Julia 1.8/1.8.1 on MacOS ARM Performance	0	276	September 16, 2022
When Julia gets within 1-3x of C/C++ speed, why is C/C++ usually faster? Performance	11	2556	November 2, 2020
Blog: Using Julia on the HPC Teaching & Outreach blog-post	40	2241	October 10, 2022

Intel C/C++ compiler performance versus Julia

Related topics