Intel C/C++ compiler performance versus Julia

I believe in theory a high level language compiler could generate code for an algorithm that is faster than the C/C++ compiler if it has more information about the data objects and usage than the C/C++ compiler is allowed to assume. I don’t know if Julia has that information or not. I suspect in certain cases the answer is yes, but probably more often the answer is no.

This question also get’s into the “fast enough” question. As programmers (in general) we are not going for the fastest execution of code, we are going for fast enough. If we wanted the absolute fasted we’d probably be writing in assembler or compiling down to assembler then hand tuning that code.

As an example of that, I was looking at the BLAKE3 reference implementation in Rust. They have a native Rust implementation, then a faster implementation that uses SIMD instructions in a C library, then the fastest implementation that uses SIMD instructions in an assembler file. They obviously felt that the C compiler was NOT fast enough.

So the “correct” question for the general programming population is does the language let me write code fast, is that code easy to maintain, and does it run it fast enough.

Probably not the answers you were looking for, but don’t think Julia will every run “real” code faster than C++ code and the Intel compiler.

9 Likes

A side question is if the Intel compiler could be used to build Julia on Windows.

I wander what Chris Elrod would have to say about this. He seems to be writing stuff in Julia that is beating the performance of everything else. A good understanding of the hardware and the interface of the code with the compiler appears to be enough to (someone like him) write code that performs in the limits of the processador capacity. I read somewhere that part of that performance comes from generated code at runtime for the types and sizes of the variables in case, something more natural to be done in a language like Julia.

9 Likes

In my (very limited) understanding assembler code is only efficient if written explicitly for the concrete hardware / microarchitecture you are using it. You have to hard-code e.g. SIMD usage, therefore if your program should run on AVX2 and AVX512, you cannot make use of the new AVX512 features, resulting in not optimal code for the latter. Of course you could program both versions and choose the optimal one on run-time dependent on your microarchitecture, but this is a lot of work.
A compiler, in contrast, may optimize C or Julia code for your specific microarchitecture, which may be more efficient in some cases.

5 Likes

@paulmelis this is a great topic for discussion. I work in HPC and I agree that there is ‘folk wisdom’ that the Intel compilers buy you performance. That was certainly the case in the past, and probably is true today.
I would love to see that Julia comparison between latest generation CPUs also.

Moving the discussion along a little, in the past on this forum we have seen benchmarking reports with Julia code which in the end turn out to be dominated by the BLAS library performance. Depends of course on the code! I think Intel put a lot of effort into those optimised math libraries and that was what wins them performance.

6 Likes

Multithreaded Julia matrix multiply (specifically, Octavian.jl) “dominates” MKL <100x100, and hangs on in performance

Simply putting LoopVectorization.@tturbo on three for loops manages to do well until we run out of L2 cache. Results will differ by CPU of course. That particular CPU has much more L2 cache than most (18 cores, 1 MiB/core).

MKL starts to run away from the competition around 3000x3000 on that computer.
On AMD systems, the small size performance of Julia-matrix multiply is unmatched:


This was on a Ryzen 4900HS.

The DifferentialEquations ecosystem doesn’t really do any hardware-specific optimizations itself yet, but it (and ModelingToolkit in particular) is another great example of how code generation can be leveraged for better problem solving in Julia.
A long-term goal of mine is to work on an SLP vectorizer it can use, as well as a SPMD compiler like ISPC. DifferentialEquations would be the target for both of these, but they should be usable by interested Julia projects more generally.

But for now, I still have a lot of loop work ahead of me (in particular, modeling much more complicated loops so they can be optimized and still produce correct results).

28 Likes

For general compiler optimizations, I don’t think high-level language differs much from low-level languages because a lot of these optimizations happens at a rather low level. They must be done at low level because these kinds of program transformation replies on some specific program properties, such as purity and invariant. These properties are much easy to be proved at low level, because high level IR is more complicated than low level one. A compiler designer working at high level IR has to cover more control-flow structure and operators. Also, to use this kind of high level information, he must be careful to preserve the information during code transformation and code lowering, which is really hard.
What you said here reminds me that a few months ago, I read an article about Rust on Hacker News claiming that Rust can be faster than C++ because ownership can improve result of alias analysis and lead to better code generation. Then a people who work on these types of compiler optimization disagreed with him and showed that C++ can actually do that even without a ownership system…

2 Likes

icc is not officially supported for building Julia base. It won’t buy you much anyways, since it only affects the Julia runtime, which typically doesn’t handle anything performance sensitive.

1 Like

Do note that there is a fair amount of Intel activity in contributing to LLVM. Here is Intel’s staging area for such contributions:

Given the different cries of LLVM becoming noticeably slower and slower it seems any speedup in the LLVM area would be appreciated. So perhaps building Julia and LLVM with Intel ICC could actually have some (but probably small) impact on overall Julia performance.

1 Like

I was thinking more is smaller and less binaries than a difference in performance. I remember a time when Octave was build both with Visual Studio and cross-compiled. At that time the VS bin was 20 MB and other 200 Mb. Ofc I’m not expecting a 10x time difference but even if it would be only half size that would be already a big win. Not specially for now but for when one are able to create stand-alone program and not having to carry hundreds of megabytes large files even for small compiled programs.

Are you sure you’re using MKL in its optimized form?
Namely using all its features to handle smaller matrices and reduce the overhead?

1 Like

I’m just ccalling dgemm through MKL_jll.
PRs to improve it or alternatives are welcome.
Note at those small sizes (<100) MKL is much faster than OpenBLAS and BLIS on the Intel CPU (and performs about the same as those two on AMD).

To run benchmarks yourself:

using BLASBenchmarksCPU
rb = runbench(libs = [:Octavian,:MKL,:blis], sizes = logspace(10, 10_000, 300));
plot(rb, displayplot = false);
3 Likes

I don’t need to run myself.
I just think that if one compares to other system to show it has better solution one must bring the other system to its optimal state.

Just as you should do in papers. When comparing to others we either take the optimal parameters or ask the developer to bring its best.

The overhead of MKL is big. I am sure you use @inbounds and remove other checks in the Julia code.
We’ll be more than happy to have a neutral package to beat MKL but we need to beat MKL’s best.
It’s time to have the best BLAS performance in open source form.

2 Likes
julia> using Octavian

julia> Octavian.matmul!(rand(2,3), rand(3,4), rand(3,4))
ERROR: AssertionError: Size mismatch.
Stacktrace:
 [1] matmul_sizes
   @ ~/.julia/dev/Octavian/src/utils.jl:15 [inlined]
 [2] _matmul!
   @ ~/.julia/dev/Octavian/src/matmul.jl:247 [inlined]
 [3] matmul!
   @ ~/.julia/dev/Octavian/src/matmul.jl:236 [inlined]
 [4] matmul!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64})
   @ Octavian ~/.julia/dev/Octavian/src/matmul.jl:220
 [5] top-level scope
   @ REPL[6]:1

Anyway, a few size checks don’t take 700 nanoseconds. This PR adds gemm_direct support:

julia> using Octavian, BLASBenchmarksCPU, BenchmarkHistograms

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);

julia> @benchmark gemmmkl_direct!($C0,$A,$B)
samples: 10000; evals/sample: 9; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2710.0 - 2830.0]  █▎227
 (2830.0 - 2950.0]  ██████████▋2099
 (2950.0 - 3070.0]  ██████████████████████████████ 5985
 (3070.0 - 3200.0]  ███████▊1545
 (3200.0 - 3320.0]  ▋111
 (3320.0 - 3440.0]  ▏6
 (3440.0 - 3560.0]  ▏2
 (3560.0 - 3680.0]  ▏1
 (3680.0 - 3800.0]   0
 (3800.0 - 3920.0]   0
 (3920.0 - 4040.0]  ▏1
 (4040.0 - 4160.0]  ▏1
 (4160.0 - 4280.0]  ▏7
 (4280.0 - 4400.0]  ▏5
 (4400.0 - 8870.0]  ▏10

                  Counts

min: 2.712 μs (0.00% GC); mean: 3.011 μs (0.00% GC); median: 3.015 μs (0.00% GC); max: 8.875 μs (0.00% GC).

julia> @benchmark gemmmkl!($C0,$A,$B)
samples: 10000; evals/sample: 9; memory estimate: 0 bytes; allocs estimate: 0
ns

 (2660.0 - 2780.0]  █▋339
 (2780.0 - 2900.0]  █████▌1171
 (2900.0 - 3020.0]  ██████████████████████████████ 6427
 (3020.0 - 3140.0]  ████████▊1865
 (3140.0 - 3260.0]  ▊155
 (3260.0 - 3380.0]  ▏16
 (3380.0 - 3500.0]  ▏2
 (3500.0 - 3620.0]   0
 (3620.0 - 3740.0]   0
 (3740.0 - 3860.0]   0
 (3860.0 - 3980.0]   0
 (3980.0 - 4100.0]  ▏1
 (4100.0 - 4220.0]  ▏6
 (4220.0 - 4340.0]  ▏8
 (4340.0 - 9160.0]  ▏10

                  Counts

min: 2.658 μs (0.00% GC); mean: 2.967 μs (0.00% GC); median: 2.972 μs (0.00% GC); max: 9.162 μs (0.00% GC).

julia> @benchmark matmul!($C0,$A,$B)
samples: 10000; evals/sample: 10; memory estimate: 0 bytes; allocs estimate: 0
ns

 (1955.0 - 1997.0 ]  ██▍381
 (1997.0 - 2039.0 ]  ███████████▌1915
 (2039.0 - 2080.0 ]  ██████████████████████████████ 5048
 (2080.0 - 2122.0 ]  ██████████████▎2387
 (2122.0 - 2163.0 ]  ▋87
 (2163.0 - 2205.0 ]  ▌79
 (2205.0 - 2247.0 ]  ▌76
 (2247.0 - 2288.0 ]  ▏11
 (2288.0 - 2330.0 ]  ▏2
 (2330.0 - 2371.0 ]  ▏1
 (2371.0 - 2413.0 ]  ▏2
 (2413.0 - 2455.0 ]   0
 (2455.0 - 2496.0 ]   0
 (2496.0 - 2538.0 ]  ▏1
 (2538.0 - 10286.0]  ▏10

                  Counts

min: 1.955 μs (0.00% GC); mean: 2.064 μs (0.00% GC); median: 2.064 μs (0.00% GC); max: 10.286 μs (0.00% GC).

julia> 2e-9*M*K*N ./ (1.955e-6, 2.658e-6, 2.712e-6) # max GFLOPS
(381.8393861892584, 280.8487584650114, 275.25663716814165)

julia> 2e-9*M*K*N ./ (2.064e-6, 2.967e-6, 3.015e-6) # median GFLOPS
(361.67441860465124, 251.59959555106173, 247.59402985074632)

Octavian isn’t specializing on matrix size, so if you want to use MKL’s JIT, then we should use statically sized StrideArrays.StrideArrays for Octavian.

10 Likes

Do you know how to optimize the MKL run in details, and what MKL_jll is not doing? That would be super helpful not only for the people developing the alternatives, but to everyone else using it.

2 Likes

MKL_DIRECT_CALL and the JIT are the two most common things Royi brings up, IIRC.
At very small sizes, the direct call helps:

julia> using BLASBenchmarksCPU, Octavian, BenchmarkHistograms

julia> M = K = N = 8; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.504561 seconds (2.53 M allocations: 133.894 MiB, 6.07% gc time, 99.99% compilation time)

julia> @benchmark matmul!($C0,$A,$B) # Octavian.jl
samples: 10000; evals/sample: 995; memory estimate: 0 bytes; allocs estimate: 0
ns

 (29.2 - 30.3]  ██████████████████████████████ 9590
 (30.3 - 31.4]  ▉272
 (31.4 - 32.5]  ▏12
 (32.5 - 33.5]  ▏2
 (33.5 - 34.6]  ▏2
 (34.6 - 35.7]  ▏3
 (35.7 - 36.8]  ▏1
 (36.8 - 37.8]   0
 (37.8 - 38.9]  ▏2
 (38.9 - 40.0]  ▏37
 (40.0 - 41.1]  ▏1
 (41.1 - 42.1]   0
 (42.1 - 43.2]  ▏21
 (43.2 - 44.3]  ▎47
 (44.3 - 71.6]  ▏10

                  Counts

min: 29.228 ns (0.00% GC); mean: 29.660 ns (0.00% GC); median: 29.518 ns (0.00% GC); max: 71.636 ns (0.00% GC).

julia> @benchmark gemmmkl!($C0,$A,$B) # MKL dgemm
samples: 10000; evals/sample: 705; memory estimate: 0 bytes; allocs estimate: 0
ns

 (178.8 - 179.6]  ██████████████████████████████ 4448
 (179.6 - 180.5]  ████████████████████████████▉4269
 (180.5 - 181.3]  ████▉719
 (181.3 - 182.2]  ███▌514
 (182.2 - 183.1]  ▎26
 (183.1 - 183.9]  ▏1
 (183.9 - 184.8]   0
 (184.8 - 185.6]  ▏1
 (185.6 - 186.5]   0
 (186.5 - 187.4]  ▏1
 (187.4 - 188.2]  ▏2
 (188.2 - 189.1]  ▏1
 (189.1 - 190.0]  ▏6
 (190.0 - 190.8]  ▏2
 (190.8 - 240.1]  ▏10

                  Counts

min: 178.756 ns (0.00% GC); mean: 179.870 ns (0.00% GC); median: 179.661 ns (0.00% GC); max: 240.072 ns (0.00% GC).

julia> @benchmark gemmmkl_direct!($C0,$A,$B) # MKL dgemm_direct
samples: 10000; evals/sample: 954; memory estimate: 0 bytes; allocs estimate: 0
ns

 (90.4  - 91.2 ]  ▏6
 (91.2  - 92.0 ]  ███▌934
 (92.0  - 92.7 ]  ██████████████████████████████ 8121
 (92.7  - 93.5 ]  █263
 (93.5  - 94.2 ]  ██▍638
 (94.2  - 95.0 ]  ▏21
 (95.0  - 95.7 ]  ▏1
 (95.7  - 96.5 ]   0
 (96.5  - 97.3 ]   0
 (97.3  - 98.0 ]  ▏2
 (98.0  - 98.8 ]   0
 (98.8  - 99.5 ]  ▏3
 (99.5  - 100.3]   0
 (100.3 - 101.1]  ▏1
 (101.1 - 136.1]  ▏10

                  Counts

min: 90.436 ns (0.00% GC); mean: 92.431 ns (0.00% GC); median: 92.367 ns (0.00% GC); max: 136.131 ns (0.00% GC).

The direct call was much faster (90 ns vs 179 ns), but still much slower than Octavian’s matmul! (30 ns).
Things other than the direct call, such as the size-specializing JIT or prepacking, would only be fair if we do the same in Julia.

It should be hard to beat Julia in the JIT department:

julia> using StaticArrays

julia> Am = @MMatrix rand(8,8); Bm = @MMatrix rand(8,8); Cm = similar(Am);

julia> @benchmark matmul!($Cm,$Am,$Bm) # StaticArrays.MMatrix are statically sized
samples: 10000; evals/sample: 999; memory estimate: 0 bytes; allocs estimate: 0
ns

 (12.71 - 12.88]  ██████████████████████████████▏9859
 (12.88 - 13.04]   0
 (13.04 - 13.2 ]   0
 (13.2  - 13.37]   0
 (13.37 - 13.53]   0
 (13.53 - 13.69]   0
 (13.69 - 13.86]  ▎63
 (13.86 - 14.02]  ▎51
 (14.02 - 14.18]  ▏8
 (14.18 - 14.34]  ▏4
 (14.34 - 14.51]  ▏3
 (14.51 - 14.67]   0
 (14.67 - 14.83]   0
 (14.83 - 15.0 ]  ▏2
 (15.0  - 51.84]  ▏10

                  Counts

min: 12.714 ns (0.00% GC); mean: 12.791 ns (0.00% GC); median: 12.770 ns (0.00% GC); max: 51.842 ns (0.00% GC).
14 Likes

Why not? You’re the one who seems to be assuming without checking that the benchmarks were not run fairly.

5 Likes

@Mason , On the contrary. I think @Elrod’s work is amazing. I will be the first to be using this instead of MKL. I don’t like MKL not being open and discriminating AMD (Though it seems they are working on it, but slowly).

My only argument is if we want to show we beat some other software we need to beat it at its prime.
What are the reactions at this forum when someone shows Julia is beaten by others and the Julia code isn’t optimized? We say it is not fair.

If we want to beat MKL we need to beat it on its best. It seems @Elrod is doing that. It just needs to be shown.

3 Likes

new (08/09/2021) blog post :
"Intel C/C++ compilers complete adoption of LLVM"

5 Likes