Hi, and thank you for the amazing Julia GPGPU ecosystem you’re building!
I’d like to hear your thoughts on GPU primitives on our platforms.
Julia - with multiple dispatch, homoiconicity and libraries like GPUCompiler and KernelAbstractions - largely solves the portability problem that’s prevalent in the GPU space; it’s amazing not being stuck between a gazillion technologies of varying maturity, vendor lock-in and ecosystems.
One other problem remains though: performance, especially for general-purpose parallel primitives like scan, reduce, sort, etc. Vendors like NVidia invest massively in extracting every last drop of performance from them, to the point of introducing specialised intrinsics and adding mazes of LEGACY_PTX_ARCH
and inline assembly for each individual architecture in their libraries. I don’t think it is feasible to try to compete performance-wise with vendor libraries like Thrust, rocPRIM, oneDPL that will always be specialised for each individual architecture, past and future.
If we solved the portability problem not by creating yet a new standard, but by embracing - or offloading to - the best technologies, could we similarly embrace the best libraries for parallel primitives?
I know CUDA.jl already uses CuBLAS, CuRAND, CuFFT for array operations - which is extraordinary! - but for solving general-purpose problems beyond algebra and ML (e.g. bounding volume hierarchies, robotics, molecular dynamics) some other building blocks would be necessary. I assume templated C++ libraries are also problematic, but for primitives I think specialisations for common types like Int*, UInt*, Float* would be more than enough to accelerate Julia GPU usage beyond AI/ML into general scientific computing.
As an idealised example, I’d love to do reduce
and sort!
at a high-level on GPU vectors with vendor-specific performance on all JuliaGPU platforms, then use KernelAbstractions
to write the non-standard bits for each application.
Best wishes,
Leonard