SCI21 paper:
- Comparing Julia to Performance Portable Parallel Programming Models for HPC [pdf]
- Index Terms—Julia, OpenMP, OpenCL, Kokkos, CUDA, HIP, Performance Portability, Programming Models, GPUs
- context: Presentation • SC21 ( 15 November 2021 )
- ( via: Hacker News: https://news.ycombinator.com/item?id=29257664 )
“VI. CONCLUSION
We ported two mini-apps to Julia to show how it is effective
at achieving high performance on a range of devices for both
memory-bandwidth and compute bound applications.
For BabelStream, we observed nearly identical performance
to the OpenMP and Kokkos versions on the CPU, with the
exception of the Dot kernel on A64FX. As A64FX is a
relatively new platform, even LLVM required extra compiler
options for optimal performance. We were able to verify that
with the latest beta release of Julia, along with a transformation
of the reduction kernel to expose the dot product expression,
Julia was able to match LLVM’s performance. For GPUs,
Julia’s various GPU packages again performed very close
(~15%) to first-party frameworks. On AMD platforms, Julia’s
AMDGPU.jl package is about 40% slower than the best
performing framework, we attribute this to the ROCm stack’s
immaturity and AMDGPU.jl’s beta status.
For miniBUDE, we get a better view on how well Julia
handles floating point optimisations. In general, x86 CPU
platforms performed well, although Julia was not able to
emit AVX512 on platforms that support it. On AArch64, we
observe difficulties for both LLVM and Julia to achieve a
high percentage of the theoretical FP32 performance with Julia
significantly slower than LLVM results. We believe compiler
backends targeting AArch64 have yet to reach the same level
of maturity for x86 platforms at the current stage. On GPUs,
Julia performed similarly to what OpenCL is getting and it
is usually less than 25% difference from the best performing
framework. Which is to say, Julia is competitive for compute
bound applications on GPUs.
Julia largely follows Python’s batteries included motto
when it comes to productivity. Many of the GPU packages
even handle downloading software dependencies and config-
uring the host system for use with Julia’s GPU support. For
example, CUDA.jl retrieves the appropriate CUDA SDK from
Nvidia on first launch.
In addition, because of Julia’s JIT execution model, along
with an ergonomic package system, creating a program
that supports multiple accelerators from different vendors is
straightforward. Traditionally, mixing frameworks that require
different host compilers (e.g. nvcc for CUDA or hipcc for
HIP) requires special attention to the overall project design
to avoid compilation issues; programmers frequently have
to resort to compiler-specific workarounds in the codebase
and implement fragile and complex build scripts. In fact, the
Kokkos framework was designed specifically to abstract this
complexity away with highly sophisticated build scripts. Julia
was able to avoid all this.
Currently, except KA.jl, Julia’s kernel portability is tied to
each of the GPU packages. In effect, writing optimised kernels
for multiple vendors still requires a manual port. However, as
each of the GPU packages share similar capabilities, the effort
required is usually limited to basic API call substitutions. We
look forward to seeing KA.jl support more platforms under
the JuliaGPU umbrella.
Thanks to Julia’s approach on reusing large parts of the
LLVM project, Julia programs enjoys comparable performance
to native C/C++ solutions. And thanks to the concentrated
effort from the open-source communities on improving LLVM,
Julia gets the unique opportunity to provide best-in-class
performance on some of the latest hardware platforms. In
general, we find Julia’s language constructs map closely to
the underlying LLVM Intermediate Representation under ideal
conditions with precisely ascribed types; various conventional
optimisation techniques and pitfalls in C/C++ still hold.
To this end, Julia offers us a glimpse of what is possible
in terms of performance for a managed, dynamically-typed
programming language. Given the overall performance-guided
design of Julia, the LLVM-backed runtime, and comparable
performance results shown here, we think Julia is a strong
competitor in achieving a high level of performance portability
for HPC use cases.”