These devs get it!
The first 10 minutes are about the Go ecosystem and are worth it for any language devs.
The rest are a great introduction to optimization issues and PGO. P.S. What is the state of PGO in Julia?
I would love to see a similar demo in Julia if the tools are available.
I’m not aware of any way or any plans for using PGO on Julia code. If you want to pgo the parts of Julia written in C/C++ you can use Add PGO+LTO Makefile by haampie · Pull Request #45641 · JuliaLang/julia · GitHub.
Related thread:
So there’s not much activity in this space, but see Tim Holy’s comment about profile-guided despecialization for a limited example.
Very interesting. What kind of performance gains could we get out of PGO? In the talk they say they get about 10% across the board for one example but I’m assuming it depends on the optimisations that are by default turned off, and also the particular example.
The statistic about “15% of functions use 99% of CPU time” from the talk is interesting. So basically the idea would be to be more aggressive about inlining, loop unrolling, and constant propagation for those 15% of functions?
Also I guess this sort of thing might just eventually be handled at the LLVM level?
I like the general idea about simply providing the compiler with runtime information rather than just source code so it can make better choices.
I feel like this also gives Julia a unique advantage over statically compiled languages, because the compilation is already happening during runtime. So in principle all of the information is already there, and you wouldn’t have this somewhat manual procedure of doing (1) compile, (2) profile, (3) re-compile with profiling information – like they talked about in Go. Julia could technically already do this during the normal runtime.
Maybe the Julia compiler could eventually have an option where it does an on-the-fly “recompilation” sometime after the initial compilation based on profiling results thus far.
Perhaps the first compilation would enable profiling; the second time would turn it off and pass the results to LLVM for PGO-enabled recompilation.
It would work especially well with a tiered JIT, like JavaScript has.
To use PGO, you need to recompile the function. You also don’t want any instrumentation in a function optimized for speed.
So your lower tiers, which are optimized for compile time rather than runtime, do the instrumentation. They already need to record at least call counts.
Then, by the time they’re called often enough to be considered hot and trigger recompilation, they can use that profile data.
On the other hand, some whole program optimizations like function reordering based on profile data won’t work as well in a JIT. You’d need to recompile many functions at a time to get much benefit of that.
Some examples of the speedup from PGO:
Julia Compiler - about 10% per Add PGO+LTO Makefile by haampie · Pull Request #45641 · JuliaLang/julia · GitHub
Rust Compiler - about 15% per Utilize PGO for windows x64 rustc dist builds by lqd · Pull Request #96978 · rust-lang/rust (github.com)
Chrome - about 10% per Chromium Blog: Chrome just got faster with Profile Guided Optimization
Also, looks like Tiered JIT Experiments by pchintalapudi · Pull Request #47484 · JuliaLang/julia (github.com) does some compiling code quickly, then pgo + optimizing if called a lot.
There is danger here of causing “performance flapping” where the system never reaches a steady state, let alone the optimal state. Check out this (imo modern classic) paper:
https://arxiv.org/pdf/1602.00602v3
Also found this video on the same subject for those who prefer a video format:
Not an insurmountable problem but one does have to be careful. One of this nice things that Julia has going for it is that since it is a Just Ahead of Time compiler that compiles like an AoT compile but just in time, it has very predictable performance characteristics, whereas VM “JITs” which are actually not in time but after time, have pretty complicated performance behavior that depends on the data they encounter and the environment in which they execute—for better and worse. The key point of that paper is that it is not all upside, there is a real downside.
Interesting. This is a strong argument in favor of my PGO API proposal from the linked Discourse thread, where the profiling, and thus PGO, is only done when the user requests it.