Yes, like this thread:
Keep in mind that julia is a compiler itself: the performance of the code it generates shouldn’t in principle depend much (or at all?) on how it was compiled, by default it compiles natively for the target system in any case. That said, by doing a native build what could improve is performance of the runtime (compilation latency, garbage collector, etc…), but depending on workload you benchmark this can have varying impact, sometimes negligible, other times more significant.
This recent pull request:
JuliaLang:master
← haampie:ttfx-improvements
opened 08:42PM - 10 Jun 22 UTC
Adds a convenient way to enable PGO+LTO on Julia and LLVM together:
1. `cd co… ntrib/pgo-lto`
2. `make -j$(nproc) stage1`
3. `make clean-profiles`
4. `./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'`
5. `make -j$(nproc) stage2`
<details>
<summary>* Output looks roughly like as follows</summary>
```c++
$ make -C contrib/pgo-lto top
make: Entering directory '/dev/shm/julia/contrib/pgo-lto'
llvm-profdata show --topn=50 /dev/shm/julia/contrib/pgo-lto/profiles/merged.prof | c++filt
Instrumentation level: IR entry_first = 0
Total functions: 85943
Maximum function count: 7867557260
Maximum internal block count: 3468437590
Top 50 functions with the largest internal block counts:
llvm::BitVector::operator|=(llvm::BitVector const&), max count = 7867557260
LateLowerGCFrame::ComputeLiveness(State&), max count = 3468437590
llvm::hashing::detail::hash_combine_recursive_helper::hash_combine_recursive_helper(), max count = 1742259834
llvm::SUnit::addPred(llvm::SDep const&, bool), max count = 511396575
llvm::LiveRange::overlaps(llvm::LiveRange const&, llvm::CoalescerPair const&, llvm::SlotIndexes const&) const, max count = 508061762
llvm::StringMapImpl::LookupBucketFor(llvm::StringRef), max count = 505682177
std::map<llvm::BasicBlock*, BBState, std::less<llvm::BasicBlock*>, std::allocator<std::pair<llvm::BasicBlock* const, BBState> > >::operator[](llvm::BasicBlock* const&), max count = 395628888
llvm::LiveRange::advanceTo(llvm::LiveRange::Segment const*, llvm::SlotIndex) const, max count = 384642728
llvm::LiveRange::isLiveAtIndexes(llvm::ArrayRef<llvm::SlotIndex>) const, max count = 380291040
llvm::PassRegistry::enumerateWith(llvm::PassRegistrationListener*), max count = 352313953
ijl_method_instance_add_backedge, max count = 349608221
llvm::SUnit::ComputeHeight(), max count = 336604330
llvm::LiveRange::advanceTo(llvm::LiveRange::Segment*, llvm::SlotIndex), max count = 331030109
llvm::SmallPtrSetImplBase::insert_imp(void const*), max count = 272966545
llvm::LiveIntervals::checkRegMaskInterference(llvm::LiveInterval&, llvm::BitVector&), max count = 257449540
LateLowerGCFrame::ComputeLiveSets(State&), max count = 252096274
/dev/shm/julia/src/jltypes.c:has_free_typevars, max count = 230879464
ijl_get_pgcstack, max count = 216953592
LateLowerGCFrame::RefineLiveSet(llvm::BitVector&, State&, std::vector<int, std::allocator<int> > const&), max count = 188013152
/dev/shm/julia/src/flisp/flisp.c:apply_cl, max count = 174863813
/dev/shm/julia/src/flisp/builtins.c:fl_memq, max count = 168621603
```
</details>
This results quite often in spectacular speedups for time to first X as
it reduces the time spent in LLVM optimization passes by 25 or even 30%.
Example 1:
```julia
using LoopVectorization
function f!(a, b)
@turbo for i in eachindex(a)
a[i] *= b[i]
end
return a
end
f!(rand(1), rand(1))
```
```console
$ time ./julia -O3 lv.jl
```
Without PGO+LTO: 14.801s
With PGO+LTO: 11.978s (-19%)
Example 2:
```console
$ time ./julia -e 'using Pkg; Pkg.test("Unitful");'
```
Without PGO+LTO: 1m47.688s
With PGO+LTO: 1m35.704s (-11%)
Example 3 (taken from issue #45395, which is almost only LLVM):
```console
$ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl
```
Without PGO+LTO:
```
===-------------------------------------------------------------------------===
... Pass execution timing report ...
===-------------------------------------------------------------------------===
Total Execution Time: 101.0130 seconds (98.6253 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
53.6961 ( 54.7%) 0.1050 ( 3.8%) 53.8012 ( 53.3%) 53.8045 ( 54.6%) Unroll loops
25.5423 ( 26.0%) 0.0072 ( 0.3%) 25.5495 ( 25.3%) 25.5444 ( 25.9%) Global Value Numbering
7.1995 ( 7.3%) 0.0526 ( 1.9%) 7.2521 ( 7.2%) 7.2517 ( 7.4%) Induction Variable Simplification
6.0541 ( 5.1%) 0.0098 ( 0.3%) 5.0639 ( 5.0%) 5.0561 ( 5.1%) Combine redundant instructions #2
```
With PGO+LTO:
```
===-------------------------------------------------------------------------===
... Pass execution timing report ...
===-------------------------------------------------------------------------===
Total Execution Time: 72.6507 seconds (70.1337 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
36.0894 ( 51.7%) 0.0825 ( 2.9%) 36.1719 ( 49.8%) 36.1738 ( 51.6%) Unroll loops
16.5713 ( 23.7%) 0.0129 ( 0.5%) 16.5843 ( 22.8%) 16.5794 ( 23.6%) Global Value Numbering
5.9047 ( 8.5%) 0.0395 ( 1.4%) 5.9442 ( 8.2%) 5.9438 ( 8.5%) Induction Variable Simplification
4.7566 ( 6.8%) 0.0078 ( 0.3%) 4.7645 ( 6.6%) 4.7575 ( 6.8%) Combine redundant instructions #2
```
Or -28% time spent in LLVM.
`perf` reports show this is mostly fewer instructions and reduction in icache misses.
---
Finally there's a significant reduction in binary sizes. For libLLVM.so:
```
79M usr/lib/libLLVM-13jl.so (before)
67M usr/lib/libLLVM-13jl.so (after)
```
And it can be reduced by another 2MB with `--icf=safe` when using LLD as
a linker anyways.
- [x] Two out-of-source builds would be better than a single in-source build, so that it's easier to find good profile data
adds support for doing PGO and LTO to Julia build system. As I said in the thread linked above, there is some improvement in terms of compile latency (which I guess everybody would be happy about), and Julia’s own runtime, but the code generated by Julia is still the same