sloede
June 18, 2022, 4:22am
1
Hey everyone! I am trying to collect information about HPC systems (from university compute clusters to supercomputers) where Julia is available as a pre-installed module and, optimally, officially supported. This is part of an effort to gather information on Julia as a HPC programming language and its prevalence in “traditional” HPC environments.
Here you will find a list of systems supporting Julia that we already have on record:
If you are a Julia HPC user, a system operator, or just someone who knows about their university’s/company’s Julia efforts, it would be great if you can contribute your system to the list. Optimally by creating a PR with the information you know (it does not have to be complete), or, alternatively, by replying to this thread. Thanks!
7 Likes
johnh
June 18, 2022, 5:44am
2
I’m very interested in the comments about the negligible performance gains of building from source vs using official binaries. I guess this is worth a thread in itself.
Yes, like this thread:
Keep in mind that julia is a compiler itself: the performance of the code it generates shouldn’t in principle depend much (or at all?) on how it was compiled, by default it compiles natively for the target system in any case. That said, by doing a native build what could improve is performance of the runtime (compilation latency, garbage collector, etc…), but depending on workload you benchmark this can have varying impact, sometimes negligible, other times more significant.
This recent pull request:
JuliaLang:master
← haampie:ttfx-improvements
opened 08:42PM - 10 Jun 22 UTC
Adds a convenient way to enable PGO+LTO on Julia and LLVM together:
1. `cd co… ntrib/pgo-lto`
2. `make -j$(nproc) stage1`
3. `make clean-profiles`
4. `./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'`
5. `make -j$(nproc) stage2`
<details>
<summary>* Output looks roughly like as follows</summary>
```c++
$ make -C contrib/pgo-lto top
make: Entering directory '/dev/shm/julia/contrib/pgo-lto'
llvm-profdata show --topn=50 /dev/shm/julia/contrib/pgo-lto/profiles/merged.prof | c++filt
Instrumentation level: IR entry_first = 0
Total functions: 85943
Maximum function count: 7867557260
Maximum internal block count: 3468437590
Top 50 functions with the largest internal block counts:
llvm::BitVector::operator|=(llvm::BitVector const&), max count = 7867557260
LateLowerGCFrame::ComputeLiveness(State&), max count = 3468437590
llvm::hashing::detail::hash_combine_recursive_helper::hash_combine_recursive_helper(), max count = 1742259834
llvm::SUnit::addPred(llvm::SDep const&, bool), max count = 511396575
llvm::LiveRange::overlaps(llvm::LiveRange const&, llvm::CoalescerPair const&, llvm::SlotIndexes const&) const, max count = 508061762
llvm::StringMapImpl::LookupBucketFor(llvm::StringRef), max count = 505682177
std::map<llvm::BasicBlock*, BBState, std::less<llvm::BasicBlock*>, std::allocator<std::pair<llvm::BasicBlock* const, BBState> > >::operator[](llvm::BasicBlock* const&), max count = 395628888
llvm::LiveRange::advanceTo(llvm::LiveRange::Segment const*, llvm::SlotIndex) const, max count = 384642728
llvm::LiveRange::isLiveAtIndexes(llvm::ArrayRef<llvm::SlotIndex>) const, max count = 380291040
llvm::PassRegistry::enumerateWith(llvm::PassRegistrationListener*), max count = 352313953
ijl_method_instance_add_backedge, max count = 349608221
llvm::SUnit::ComputeHeight(), max count = 336604330
llvm::LiveRange::advanceTo(llvm::LiveRange::Segment*, llvm::SlotIndex), max count = 331030109
llvm::SmallPtrSetImplBase::insert_imp(void const*), max count = 272966545
llvm::LiveIntervals::checkRegMaskInterference(llvm::LiveInterval&, llvm::BitVector&), max count = 257449540
LateLowerGCFrame::ComputeLiveSets(State&), max count = 252096274
/dev/shm/julia/src/jltypes.c:has_free_typevars, max count = 230879464
ijl_get_pgcstack, max count = 216953592
LateLowerGCFrame::RefineLiveSet(llvm::BitVector&, State&, std::vector<int, std::allocator<int> > const&), max count = 188013152
/dev/shm/julia/src/flisp/flisp.c:apply_cl, max count = 174863813
/dev/shm/julia/src/flisp/builtins.c:fl_memq, max count = 168621603
```
</details>
This results quite often in spectacular speedups for time to first X as
it reduces the time spent in LLVM optimization passes by 25 or even 30%.
Example 1:
```julia
using LoopVectorization
function f!(a, b)
@turbo for i in eachindex(a)
a[i] *= b[i]
end
return a
end
f!(rand(1), rand(1))
```
```console
$ time ./julia -O3 lv.jl
```
Without PGO+LTO: 14.801s
With PGO+LTO: 11.978s (-19%)
Example 2:
```console
$ time ./julia -e 'using Pkg; Pkg.test("Unitful");'
```
Without PGO+LTO: 1m47.688s
With PGO+LTO: 1m35.704s (-11%)
Example 3 (taken from issue #45395, which is almost only LLVM):
```console
$ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl
```
Without PGO+LTO:
```
===-------------------------------------------------------------------------===
... Pass execution timing report ...
===-------------------------------------------------------------------------===
Total Execution Time: 101.0130 seconds (98.6253 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
53.6961 ( 54.7%) 0.1050 ( 3.8%) 53.8012 ( 53.3%) 53.8045 ( 54.6%) Unroll loops
25.5423 ( 26.0%) 0.0072 ( 0.3%) 25.5495 ( 25.3%) 25.5444 ( 25.9%) Global Value Numbering
7.1995 ( 7.3%) 0.0526 ( 1.9%) 7.2521 ( 7.2%) 7.2517 ( 7.4%) Induction Variable Simplification
6.0541 ( 5.1%) 0.0098 ( 0.3%) 5.0639 ( 5.0%) 5.0561 ( 5.1%) Combine redundant instructions #2
```
With PGO+LTO:
```
===-------------------------------------------------------------------------===
... Pass execution timing report ...
===-------------------------------------------------------------------------===
Total Execution Time: 72.6507 seconds (70.1337 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
36.0894 ( 51.7%) 0.0825 ( 2.9%) 36.1719 ( 49.8%) 36.1738 ( 51.6%) Unroll loops
16.5713 ( 23.7%) 0.0129 ( 0.5%) 16.5843 ( 22.8%) 16.5794 ( 23.6%) Global Value Numbering
5.9047 ( 8.5%) 0.0395 ( 1.4%) 5.9442 ( 8.2%) 5.9438 ( 8.5%) Induction Variable Simplification
4.7566 ( 6.8%) 0.0078 ( 0.3%) 4.7645 ( 6.6%) 4.7575 ( 6.8%) Combine redundant instructions #2
```
Or -28% time spent in LLVM.
`perf` reports show this is mostly fewer instructions and reduction in icache misses.
---
Finally there's a significant reduction in binary sizes. For libLLVM.so:
```
79M usr/lib/libLLVM-13jl.so (before)
67M usr/lib/libLLVM-13jl.so (after)
```
And it can be reduced by another 2MB with `--icf=safe` when using LLD as
a linker anyways.
- [x] Two out-of-source builds would be better than a single in-source build, so that it's easier to find good profile data
adds support for doing PGO and LTO to Julia build system. As I said in the thread linked above, there is some improvement in terms of compile latency (which I guess everybody would be happy about), and Julia’s own runtime, but the code generated by Julia is still the same
sloede
June 19, 2022, 10:24am
4
Thanks for sharing @giordano ! I am watching this PR now and once its merged and part of at least a release candidate, I’ll give it a try in an HPC environment to see what kind of speed difference I can get.
The pull request only adds a Makefile, not sure it’s worth waiting months until it makes it in a new release, you can copy it and follow the instructions