Building julia with `march=native`

@ImreSamu: regarding Clear Linux and the Phoronix benchmark, you have to take it with a grain of salt. For example they have amazing numbers for Zstd, but the reason is they set the default number of threads to 4, where other distro’s only use 1; the benchmark is builtin zstd -b ..., so it’s apples and oranges (zstd benchmark misleading · Issue #633 · phoronix-test-suite/phoronix-test-suite · GitHub). Another issue is Phoronix compares Zstd 1.4 with 1.5, and there’s huge performance improvements between those releases, so it’s really not about Clear Linux being faster, but about them shipping a more recent version.

But sure, in some cases they get pretty good speedups, usually though about 5 - 10% thanks to PGO or LTO.

Getting PGO right is tricky, but I think this is the best type of optimization for large and branchy code bases like LLVM. I’ve spent some time enabling getting PGO to work in the Spack package manager for software stacks in general, and for Julia + LLVM things are really good.

Using JULIA_LLVM_ARGS=-time-passes julia -O3 ./script.jl where script.jl is pretty much using LoopVectorization with a @turbo’d inner product, compilation time drops by 25% for the most expensive LLVM passes:

[official binaries, generic, GCC, no PGO]:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3508 ( 18.3%)   0.3359 ( 19.3%)   1.6867 ( 18.5%)   1.6809 ( 18.5%)  X86 DAG->DAG Instruction Selection
   0.6616 (  9.0%)   0.3140 ( 18.0%)   0.9756 ( 10.7%)   0.9732 ( 10.7%)  X86 Assembly Printer
   0.3632 (  4.9%)   0.0501 (  2.9%)   0.4132 (  4.5%)   0.4126 (  4.5%)  Greedy Register Allocator
   0.3423 (  4.6%)   0.0511 (  2.9%)   0.3934 (  4.3%)   0.3904 (  4.3%)  Combine redundant instructions

[spack, generic, clang 14, no PGO]
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.2367 ( 18.8%)   0.1931 ( 19.6%)   1.4299 ( 18.9%)   1.4298 ( 18.9%)  X86 DAG->DAG Instruction Selection
   0.5216 (  7.9%)   0.1528 ( 15.5%)   0.6744 (  8.9%)   0.6744 (  8.9%)  X86 Assembly Printer
   0.3438 (  5.2%)   0.0281 (  2.9%)   0.3719 (  4.9%)   0.3718 (  4.9%)  Greedy Register Allocator
   0.3196 (  4.8%)   0.0333 (  3.4%)   0.3529 (  4.7%)   0.3525 (  4.7%)  Combine redundant instructions
   
[spack, generic, clang 14, PGO]:
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.9713 ( 17.9%)   0.1708 ( 19.8%)   1.1421 ( 18.2%)   1.1421 ( 18.2%)  X86 DAG->DAG Instruction Selection
   0.5047 (  9.3%)   0.1283 ( 14.9%)   0.6329 ( 10.1%)   0.6329 ( 10.1%)  X86 Assembly Printer
   0.3070 (  5.7%)   0.0325 (  3.8%)   0.3395 (  5.4%)   0.3395 (  5.4%)  Greedy Register Allocator
   0.2403 (  4.4%)   0.0340 (  3.9%)   0.2743 (  4.4%)   0.2741 (  4.4%)  Post RA top-down list latency scheduler
5 Likes