Building julia with `march=native`

stabbles · June 5, 2022, 11:39am

@ImreSamu: regarding Clear Linux and the Phoronix benchmark, you have to take it with a grain of salt. For example they have amazing numbers for Zstd, but the reason is they set the default number of threads to 4, where other distro’s only use 1; the benchmark is builtin zstd -b ..., so it’s apples and oranges (zstd benchmark misleading · Issue #633 · phoronix-test-suite/phoronix-test-suite · GitHub). Another issue is Phoronix compares Zstd 1.4 with 1.5, and there’s huge performance improvements between those releases, so it’s really not about Clear Linux being faster, but about them shipping a more recent version.

But sure, in some cases they get pretty good speedups, usually though about 5 - 10% thanks to PGO or LTO.

Getting PGO right is tricky, but I think this is the best type of optimization for large and branchy code bases like LLVM. I’ve spent some time enabling getting PGO to work in the Spack package manager for software stacks in general, and for Julia + LLVM things are really good.

Using JULIA_LLVM_ARGS=-time-passes julia -O3 ./script.jl where script.jl is pretty much using LoopVectorization with a @turbo’d inner product, compilation time drops by 25% for the most expensive LLVM passes:

[official binaries, generic, GCC, no PGO]:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3508 ( 18.3%)   0.3359 ( 19.3%)   1.6867 ( 18.5%)   1.6809 ( 18.5%)  X86 DAG->DAG Instruction Selection
   0.6616 (  9.0%)   0.3140 ( 18.0%)   0.9756 ( 10.7%)   0.9732 ( 10.7%)  X86 Assembly Printer
   0.3632 (  4.9%)   0.0501 (  2.9%)   0.4132 (  4.5%)   0.4126 (  4.5%)  Greedy Register Allocator
   0.3423 (  4.6%)   0.0511 (  2.9%)   0.3934 (  4.3%)   0.3904 (  4.3%)  Combine redundant instructions

[spack, generic, clang 14, no PGO]
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.2367 ( 18.8%)   0.1931 ( 19.6%)   1.4299 ( 18.9%)   1.4298 ( 18.9%)  X86 DAG->DAG Instruction Selection
   0.5216 (  7.9%)   0.1528 ( 15.5%)   0.6744 (  8.9%)   0.6744 (  8.9%)  X86 Assembly Printer
   0.3438 (  5.2%)   0.0281 (  2.9%)   0.3719 (  4.9%)   0.3718 (  4.9%)  Greedy Register Allocator
   0.3196 (  4.8%)   0.0333 (  3.4%)   0.3529 (  4.7%)   0.3525 (  4.7%)  Combine redundant instructions
   
[spack, generic, clang 14, PGO]:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.9713 ( 17.9%)   0.1708 ( 19.8%)   1.1421 ( 18.2%)   1.1421 ( 18.2%)  X86 DAG->DAG Instruction Selection
   0.5047 (  9.3%)   0.1283 ( 14.9%)   0.6329 ( 10.1%)   0.6329 ( 10.1%)  X86 Assembly Printer
   0.3070 (  5.7%)   0.0325 (  3.8%)   0.3395 (  5.4%)   0.3395 (  5.4%)  Greedy Register Allocator
   0.2403 (  4.4%)   0.0340 (  3.9%)   0.2743 (  4.4%)   0.2741 (  4.4%)  Post RA top-down list latency scheduler

Topic		Replies	Views
Speeding up julia on aarch64 Internals & Design aarch64 , arm	15	2458	April 29, 2020
Show off Julia performance on your PC! Performance	53	4295	April 26, 2020
Compilation options for Downfall mitigation Performance question	4	879	October 25, 2023
Intel C/C++ compiler performance versus Julia Offtopic	20	6231	August 11, 2021
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36472	June 19, 2020

Building julia with `march=native`

Related topics