Prioritising GPU Primitives from Vendor-Specialised Libraries

Hi, and thank you for the amazing Julia GPGPU ecosystem you’re building!
I’d like to hear your thoughts on GPU primitives on our platforms.

Julia - with multiple dispatch, homoiconicity and libraries like GPUCompiler and KernelAbstractions - largely solves the portability problem that’s prevalent in the GPU space; it’s amazing not being stuck between a gazillion technologies of varying maturity, vendor lock-in and ecosystems.

One other problem remains though: performance, especially for general-purpose parallel primitives like scan, reduce, sort, etc. Vendors like NVidia invest massively in extracting every last drop of performance from them, to the point of introducing specialised intrinsics and adding mazes of LEGACY_PTX_ARCH and inline assembly for each individual architecture in their libraries. I don’t think it is feasible to try to compete performance-wise with vendor libraries like Thrust, rocPRIM, oneDPL that will always be specialised for each individual architecture, past and future.

If we solved the portability problem not by creating yet a new standard, but by embracing - or offloading to - the best technologies, could we similarly embrace the best libraries for parallel primitives?

I know CUDA.jl already uses CuBLAS, CuRAND, CuFFT for array operations - which is extraordinary! - but for solving general-purpose problems beyond algebra and ML (e.g. bounding volume hierarchies, robotics, molecular dynamics) some other building blocks would be necessary. I assume templated C++ libraries are also problematic, but for primitives I think specialisations for common types like Int*, UInt*, Float* would be more than enough to accelerate Julia GPU usage beyond AI/ML into general scientific computing.

As an idealised example, I’d love to do reduce and sort! at a high-level on GPU vectors with vendor-specific performance on all JuliaGPU platforms, then use KernelAbstractions to write the non-standard bits for each application.

Best wishes,

From my perspective, vendor libraries are a pain:

They only work on some architectures and not others (those that have hand-written kernels or the developers thought were worth supporting), and not just GPUs - you also need support for the CPU architecture and OS that you want to use.

Interfacing with C++ sucks, and the ABI is not fixed. This makes it hard to reliably utilize libraries which might have been compiled with a different or older/newer compiler (which can even happen when mixing JLLs).

Exception reporting across a foreign-function interface is mostly non-existent. You’re basically stuck with C-style error codes, where the same code could be returned from multiple completely unrelated code paths.

Vendor libraries are sometimes proprietary, so good luck figuring out why your code is slow or doesn’t produce the right results (or crashes outright). This is slightly better when using AMD or Intel GPUs, as the vendor libraries are (for the most part) open source, but they’re often still complicated beasts written in a language most Julia programmers may not be familiar with.

If you make vendor libraries optional (as one should if one cares about supporting users with all kinds of computing configurations), you’ve now got to handle a bunch of extra details, like ensuring that library handles are correctly allocated and freed at the right times, validating that arguments meet the available data formats (and falling back to a native-Julia implementation when not), debugging issues involving multiple libraries, etc.

Deal with bug reports caused by a misbehaving library. Maintenance burden can not be understated, many maintainers are already stretched to their limits, so let’s not give them a bunch of bug reports to sift through and inevitably close or forward to the vendor (and that’s after a lengthy investigation to discover that the vendor library is to blame).

Basically all of the above is not as much of an issue when writing pure-Julia algorithms, and doing so often has the benefit of providing a vendor-agnostic algorithm that everyone can benefit from. For example, right now I’m working on building various device-side algorithms (state machines, memory allocators, exception reporting machinery) which I plan to make available to other GPU backends to improve the overall JuliaGPU computing experience.

From a community standpoint, building in Julia also means that we can attract more talent and corporate interest, at least once we’re showing competitive performance with the alternatives. No one is going to pay for developing a Julia package when all we do is call into C++; they’ll just pay the C++ library developers instead, or just switch wholesale to C++.

What we need right now are more developers who want to build things in pure Julia. Increasing developer counts eventually brings us to “critical mass” where we have enough talent and development time to keep us competitive with the alternatives; we’ve already achieved this for Differential Equations and Mathematical Optimization (among other things), and we’re starting to approach it on the GPU and Machine Learning fronts as well.