Compiling Julia using LTO+PGO

to what extent would this affect performance, also for julia, what part outside of benchmark framework
isn’t measured e.g. not only throughput, latency(time to first paint, hurts), RAM/power usage, how close is this to being setup/built

-may more be of use to release builds

  • PGO requires a generate run, (how to improve),
  • LTO has existed for years (less bugs,see gcc-6 changelog)

experiments (linux) much smaller binary sizes (dead code elimination, performance not much difference, can it be improved?)

  • sys.so 186.6→137.5MB
  • libjulia.so.1.4 32.3→5.6MB
    but some not affected
  • libLLVM-8jl.so 56.9→56.9MB
  • libopenblas64_.0.3.5.so 30.6→30.6MB

is there a better way than adding -fprofile-generate then -fprofile-use , -O3 -march=native -flto to (C,CXX,LD)FLAGS environment variables? + modern compiler (gcc8+),

Profiled (PGO) builds usually make use of a run that exercises the code for profiling (with representative coverage) (e.g. python compile has option)

non-expert, not much on @ certain programming (irrelevant?)

various links:

some performance benchmarks:

image

why not/hinderances: (LTO seems to make debugging harder and PGO has to be compile twice (if profile genreation is unrepresentive there may be lack of improvement as result), thus ways to improve ease of use and only for release builds maybe esp for faster more emphasis on program that use up “CPU time”(effort)

but for people not building julia, a major point heard is the time-to-first-paint, would this have an impact, how much

notice LLVM parts don’t seem to be impacted, I don’t know enough about the build process to affect it (same size) is this of use ?.. https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md

1 Like

I hope there will be an official “optimized” Julia binary for the “x86_64 feature levels”

  • -O3 -march=x86-64-v3 +LTO +PGO +BOLT
  • -O3 -march=x86-64-v4 +LTO +PGO +BOLT

The x86_64 feature levels have been merged to LLVM Clang 12

  • x86-64-v3: (close to Haswell) AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE
  • x86-64-v4: AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL

related julia issue:

related news:

Comment:

1 Like

Why do you think that’s important?

1 Like

Benchmark would be better; so please correct me if I am wrong:

My reasons:

  • placebo & marketing
    • in the future the downloadable -O3 -march=x86-64-v3(4) +LTO +PGO +BOLT image will be a sign of dogfooding optimization.
  • slimmer binary size expected ( vs current “X86: multi-microarchitecture system image” )
    • ideal for the optimized Julia docker images
  • now the X86 “multi-microarchitecture system image” is
    • "generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1)"
      • haswell=~x86-64-v3 OK ( but not visible for the first time users )
      • And no optimized x86-64-v4
  • “+LTO +PGO +BOLT” → faster compile time expected for the end users
    • The “BOLT”-ed image is harder …

CON:

  • more complexity
  • need more build time. ( not a CI/CD friendly )
  • need more test and developer resources

IMHO:

  • it would be an interesting GSOC 2022(23) project for creating an “+LTO +PGO +BOLT”-ed Julia image. ( as an research project - helping energy-efficient green computing )

The performance gain in LLVM is likely negligible. I tried compiling all julia’s dependencies with -march=zenvr2 using GCC 10 and then benchmarked precompile times of LLVM.jl a few times.

With generic binaries for Julia:

11.141739 seconds (1.94 M allocations: 133.629 MiB, 0.29% gc time, 6.16% compilation time)
11.106031 seconds (1.94 M allocations: 133.632 MiB, 0.19% gc time, 6.15% compilation time)
11.183070 seconds (1.94 M allocations: 133.614 MiB, 0.55% gc time, 5.84% compilation time)
11.084295 seconds (1.94 M allocations: 133.610 MiB, 0.55% gc time, 6.12% compilation time)

With -march=znver2:

10.917630 seconds (1.94 M allocations: 133.787 MiB, 0.30% gc time, 5.74% compilation time)
10.977101 seconds (1.94 M allocations: 133.803 MiB, 0.53% gc time, 5.79% compilation time)
11.000003 seconds (1.94 M allocations: 133.807 MiB, 0.38% gc time, 5.73% compilation time)
10.920701 seconds (1.94 M allocations: 133.804 MiB, 0.56% gc time, 6.12% compilation time)

If you want to try it yourself: https://github.com/spack/spack/pull/27280#issue-1047361063.

taskset -c 0 ./spack/opt/spack/linux-sles15-zen2/gcc-10.3.0/julia-1.7.0-rc3-47wy4knrqrzqqga56jeau55epdl5mkvz/bin/julia -e 'using Pkg; @time Pkg.precompile()'

Edit: a slightly more interesting benchmark where some code is compiled and run. The following script:

using LoopVectorization

function f!(z, x, y)
  @avx for i = eachindex(z)
    z[i] = x[i] * y[i]
  end
  z
end

f!(rand(10), rand(10), rand(10))

with LoopVectorization 0.12.98 run as follows:

julia --project -e 'using Pkg; Pkg.instantiate(); @time include("script.jl")'

Generic binaries & sysimage:

13.497693 seconds (18.24 M allocations: 984.107 MiB, 2.38% gc time, 91.57% compilation time)
13.464525 seconds (18.24 M allocations: 984.137 MiB, 2.35% gc time, 91.43% compilation time)
13.513310 seconds (18.24 M allocations: 984.137 MiB, 2.58% gc time, 91.50% compilation time)
13.485646 seconds (18.24 M allocations: 984.135 MiB, 2.41% gc time, 91.38% compilation time)

-march=zenvr2:

12.997209 seconds (18.26 M allocations: 985.012 MiB, 2.45% gc time, 91.29% compilation time)
13.035375 seconds (18.26 M allocations: 985.014 MiB, 2.42% gc time, 91.15% compilation time)
13.014416 seconds (18.26 M allocations: 985.016 MiB, 2.44% gc time, 91.15% compilation time)
13.042701 seconds (18.26 M allocations: 985.014 MiB, 2.46% gc time, 91.10% compilation time)
3 Likes