SC21: "Comparing Julia to Performance Portable Parallel Programming Models for HPC"

SCI21 paper:

“VI. CONCLUSION
We ported two mini-apps to Julia to show how it is effective
at achieving high performance on a range of devices for both
memory-bandwidth and compute bound applications.
For BabelStream, we observed nearly identical performance
to the OpenMP and Kokkos versions on the CPU, with the
exception of the Dot kernel on A64FX. As A64FX is a
relatively new platform, even LLVM required extra compiler
options for optimal performance. We were able to verify that
with the latest beta release of Julia, along with a transformation
of the reduction kernel to expose the dot product expression,
Julia was able to match LLVM’s performance. For GPUs,
Julia’s various GPU packages again performed very close
(~15%) to first-party frameworks. On AMD platforms, Julia’s
AMDGPU.jl package is about 40% slower than the best
performing framework, we attribute this to the ROCm stack’s
immaturity and AMDGPU.jl’s beta status.
For miniBUDE, we get a better view on how well Julia
handles floating point optimisations. In general, x86 CPU
platforms performed well, although Julia was not able to
emit AVX512 on platforms that support it. On AArch64, we
observe difficulties for both LLVM and Julia to achieve a
high percentage of the theoretical FP32 performance with Julia
significantly slower than LLVM results. We believe compiler
backends targeting AArch64 have yet to reach the same level
of maturity for x86 platforms at the current stage. On GPUs,
Julia performed similarly to what OpenCL is getting and it
is usually less than 25% difference from the best performing
framework. Which is to say, Julia is competitive for compute
bound applications on GPUs.
Julia largely follows Python’s batteries included motto
when it comes to productivity. Many of the GPU packages
even handle downloading software dependencies and config-
uring the host system for use with Julia’s GPU support. For
example, CUDA.jl retrieves the appropriate CUDA SDK from
Nvidia on first launch.
In addition, because of Julia’s JIT execution model, along
with an ergonomic package system, creating a program
that supports multiple accelerators from different vendors is
straightforward. Traditionally, mixing frameworks that require
different host compilers (e.g. nvcc for CUDA or hipcc for
HIP) requires special attention to the overall project design
to avoid compilation issues; programmers frequently have
to resort to compiler-specific workarounds in the codebase
and implement fragile and complex build scripts. In fact, the
Kokkos framework was designed specifically to abstract this
complexity away with highly sophisticated build scripts. Julia
was able to avoid all this.
Currently, except KA.jl, Julia’s kernel portability is tied to
each of the GPU packages. In effect, writing optimised kernels
for multiple vendors still requires a manual port. However, as
each of the GPU packages share similar capabilities, the effort
required is usually limited to basic API call substitutions. We
look forward to seeing KA.jl support more platforms under
the JuliaGPU umbrella.
Thanks to Julia’s approach on reusing large parts of the
LLVM project, Julia programs enjoys comparable performance
to native C/C++ solutions. And thanks to the concentrated
effort from the open-source communities on improving LLVM,
Julia gets the unique opportunity to provide best-in-class
performance on some of the latest hardware platforms. In
general, we find Julia’s language constructs map closely to
the underlying LLVM Intermediate Representation under ideal
conditions with precisely ascribed types; various conventional
optimisation techniques and pitfalls in C/C++ still hold.
To this end, Julia offers us a glimpse of what is possible
in terms of performance for a managed, dynamically-typed
programming language. Given the overall performance-guided
design of Julia, the LLVM-backed runtime, and comparable
performance results shown here, we think Julia is a strong
competitor in achieving a high level of performance portability
for HPC use cases.”

12 Likes

Other links, from the paper:

Artifacts Available: Source code for JuliaStream.jl is currently in the process of being merged into BabelStream. The Pull Request, along with reviews from members of the Julia community is available at https://github.com/UoB-HPC/BabelStream/pull/106. Source code for miniBUDE.jl is now part of the miniBUDE benchmark, available at GitHub - UoB-HPC/miniBUDE: A BUDE virtual-screening benchmark, in many programming models. We have created scripts to help make the results in this paper reproducible. The source code can be found at GitHub - UoB-HPC/performance-portability at 2021-benchmarking.

2 Likes

Some comments by @Elrod and myself posted on Slack before this thread existed:

Carsten Bauer 4 hours ago

It is not clear whether Julia currently implements any mechanism for configuring thread affinity policies, tools like numactl may be able to counteract this as a workaround.

GitHub - carstenbauer/ThreadPinning.jl: Readily pin Julia threads to CPU processors but I guess I didn’t create it soon enough However, they could have used https://juliaperf.github.io/LIKWID.jl/stable/likwid-pin/. Anyways, if you use “spread” or (as I call it) “scattered” pinning you will get a curve similar to OpenMP, see e.g. GitHub - JuliaPerf/BandwidthBenchmark.jl: Measuring memory bandwidth using TheBandwidthBenchmark

Chris Elrod 4 hours ago

For Xeon, we identified
that Julia did not emit AVX512 instructions. On OpenMP,
and by extension, Kokkos, both implementations successfully
emitted AVX512 instructions when compiled using the correct
set of optimisation flags. We were able to confirm that the lack
of AVX512 contributed to the significantly lower performance
(>30% difference) by replacing -march=skylake-avx512
-mprefer-vector-width=512 with just -march=skylake
on both the OpenMP and Kokkos implementation. With the
non-AVX512 version of OpenMP and Kokkos, Julia showed
nearly identical performance

FWIW, I always start Julia (on AVX512 platforms) with -C"native,-prefer-256-bit" .
We could explicitly document this.

1 Like