Compiling Julia using LTO+PGO

D4taV1s · May 9, 2020, 9:10am

to what extent would this affect performance, also for julia, what part outside of benchmark framework
isn’t measured e.g. not only throughput, latency(time to first paint, hurts), RAM/power usage, how close is this to being setup/built

-may more be of use to release builds

PGO requires a generate run, (how to improve),
LTO has existed for years (less bugs,see gcc-6 changelog)

experiments (linux) much smaller binary sizes (dead code elimination, performance not much difference, can it be improved?)

sys.so 186.6→137.5MB
libjulia.so.1.4 32.3→5.6MB
but some not affected
libLLVM-8jl.so 56.9→56.9MB
libopenblas64_.0.3.5.so 30.6→30.6MB

is there a better way than adding -fprofile-generate then -fprofile-use , -O3 -march=native -flto to (C,CXX,LD)FLAGS environment variables? + modern compiler (gcc8+),

Profiled (PGO) builds usually make use of a run that exercises the code for profiling (with representative coverage) (e.g. python compile has option)

non-expert, not much on @ certain programming (irrelevant?)

various links:

some performance benchmarks:

why not/hinderances: (LTO seems to make debugging harder and PGO has to be compile twice (if profile genreation is unrepresentive there may be lack of improvement as result), thus ways to improve ease of use and only for release builds maybe esp for faster more emphasis on program that use up “CPU time”(effort)

but for people not building julia, a major point heard is the time-to-first-paint, would this have an impact, how much

notice LLVM parts don’t seem to be impacted, I don’t know enough about the build process to affect it (same size) is this of use ?.. https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md

ImreSamu · November 14, 2021, 12:14pm

I hope there will be an official “optimized” Julia binary for the “x86_64 feature levels”

-O3 -march=x86-64-v3 +LTO +PGO +BOLT
-O3 -march=x86-64-v4 +LTO +PGO +BOLT

The x86_64 feature levels have been merged to LLVM Clang 12

x86-64-v3: (close to Haswell) AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE
x86-64-v4: AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL

related julia issue:

https://github.com/JuliaLang/julia/issues/42073

related news:

2021nov04: Facebook’s BOLT Nearing Mainline LLVM For Optimizing Binaries

Comment:

The “Pyston” (~ optimized Python) project started using BOLT optimization
- https://github.com/pyston/pyston/pull/67
  - “Tested that this is about 5.5% faster for kinto_bench_unopt”
- And "all optimizations enabled (LTO+PGO) " with the default make

giordano · November 14, 2021, 2:34pm

Why do you think that’s important?

ImreSamu · November 14, 2021, 4:11pm

Benchmark would be better; so please correct me if I am wrong:

My reasons:

placebo & marketing
- in the future the downloadable -O3 -march=x86-64-v3(4) +LTO +PGO +BOLT image will be a sign of dogfooding optimization.
slimmer binary size expected ( vs current “X86: multi-microarchitecture system image” )
- ideal for the optimized Julia docker images
now the X86 “multi-microarchitecture system image” is
- "generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)"
  - haswell=~x86-64-v3 OK ( but not visible for the first time users )
  - And no optimized x86-64-v4
“+LTO +PGO +BOLT” → faster compile time expected for the end users
- The “BOLT”-ed image is harder …

CON:

more complexity
need more build time. ( not a CI/CD friendly )
need more test and developer resources

IMHO:

it would be an interesting GSOC 2022(23) project for creating an “+LTO +PGO +BOLT”-ed Julia image. ( as an research project - helping energy-efficient green computing )

stabbles · November 24, 2021, 2:44pm

The performance gain in LLVM is likely negligible. I tried compiling all julia’s dependencies with -march=zenvr2 using GCC 10 and then benchmarked precompile times of LLVM.jl a few times.

With generic binaries for Julia:

11.141739 seconds (1.94 M allocations: 133.629 MiB, 0.29% gc time, 6.16% compilation time)
11.106031 seconds (1.94 M allocations: 133.632 MiB, 0.19% gc time, 6.15% compilation time)
11.183070 seconds (1.94 M allocations: 133.614 MiB, 0.55% gc time, 5.84% compilation time)
11.084295 seconds (1.94 M allocations: 133.610 MiB, 0.55% gc time, 6.12% compilation time)

With -march=znver2:

10.917630 seconds (1.94 M allocations: 133.787 MiB, 0.30% gc time, 5.74% compilation time)
10.977101 seconds (1.94 M allocations: 133.803 MiB, 0.53% gc time, 5.79% compilation time)
11.000003 seconds (1.94 M allocations: 133.807 MiB, 0.38% gc time, 5.73% compilation time)
10.920701 seconds (1.94 M allocations: 133.804 MiB, 0.56% gc time, 6.12% compilation time)

If you want to try it yourself: https://github.com/spack/spack/pull/27280#issue-1047361063.

taskset -c 0 ./spack/opt/spack/linux-sles15-zen2/gcc-10.3.0/julia-1.7.0-rc3-47wy4knrqrzqqga56jeau55epdl5mkvz/bin/julia -e 'using Pkg; @time Pkg.precompile()'

Edit: a slightly more interesting benchmark where some code is compiled and run. The following script:

using LoopVectorization

function f!(z, x, y)
  @avx for i = eachindex(z)
    z[i] = x[i] * y[i]
  end
  z
end

f!(rand(10), rand(10), rand(10))

with LoopVectorization 0.12.98 run as follows:

julia --project -e 'using Pkg; Pkg.instantiate(); @time include("script.jl")'

Generic binaries & sysimage:

13.497693 seconds (18.24 M allocations: 984.107 MiB, 2.38% gc time, 91.57% compilation time)
13.464525 seconds (18.24 M allocations: 984.137 MiB, 2.35% gc time, 91.43% compilation time)
13.513310 seconds (18.24 M allocations: 984.137 MiB, 2.58% gc time, 91.50% compilation time)
13.485646 seconds (18.24 M allocations: 984.135 MiB, 2.41% gc time, 91.38% compilation time)

-march=zenvr2:

12.997209 seconds (18.26 M allocations: 985.012 MiB, 2.45% gc time, 91.29% compilation time)
13.035375 seconds (18.26 M allocations: 985.014 MiB, 2.42% gc time, 91.15% compilation time)
13.014416 seconds (18.26 M allocations: 985.016 MiB, 2.44% gc time, 91.15% compilation time)
13.042701 seconds (18.26 M allocations: 985.014 MiB, 2.46% gc time, 91.10% compilation time)

Topic		Replies	Views
Has PGO for (re)compiling methods been considered? Internals & Design	5	450	April 13, 2023
Nice talk about Go PGO. What is the state in Julia? Internals & Design profiling , developer-tools , compiler	8	926	June 24, 2024
HPC systems with Julia support Julia at Scale question , hpc	4	895	June 19, 2022
Building julia with `march=native` General Usage	15	1418	June 5, 2022
What LLVM version to use, 10, 11 possible? And how to reduce startup time (for the Benchmark Game)? Performance	3	1087	March 30, 2020

Compiling Julia using LTO+PGO

Related topics