Hi everyone,
I’m excited to announce the first public release of AcceleratedKernels.jl, a high-performance library of parallel algorithm building blocks for the Julia ecosystem, targeting:
- Multithreaded CPUs, and GPUs via:
- Intel oneAPI
- AMD ROCm
- Apple Metal
- Nvidia CUDA
- And any future backends added to the JuliaGPU organisation, thanks to the fantastic KernelAbstractions.jl kernel language.
A few highlights:
- Multithreaded, arithmetic-heavy benchmarks show performance on par with, or faster than C and OpenMP.
- Perhaps surprisingly, there are cases where Julia is more consistent / predictable with numerical performance than conventional C compilers.
- GPU performance on the same order of magnitude with official vendor libraries like Nvidia Thrust, but completely backend-agnostic.
- Exceptional composability with other Julia libraries like MPISort.jl, with which we can do CPU-GPU co-processing (e.g. CPU-GPU co-sorting!) with very good performance.
- We reached 538-855 GB/s sorting throughput on 200 GPUs (comparable with the highest figure reported in literature of 900 GB/s on 262,144 CPU cores).
- User-friendliness - you can convert normal Julia for-loops into GPU kernels by swapping
for i in eachindex(itr)
withAK.foreachindex(itr) do i
. See example below:
CPU Code | GPU code |
---|---|
|
|
Again, this is only possible because of the unique Julia compilation model, the JuliaGPU organisation work for reusable GPU backend infrastructure, and especially the KernelAbstractions.jl backend-agnostic kernel language. Thank you.
Below is an overview of the currently-implemented algorithms, along with some common names in other libraries for ease of finding / understanding / porting code. If you need other algorithms in your work that may be of general use, please open an issue and we may implement it, help you implement it, or integrate existing code into AcceleratedKernels.jl. See API Examples in the GitHub repository for usage.
Function Family | AcceleratedKernels.jl Functions | Other Common Names |
---|---|---|
General Looping | foreachindex |
Kokkos::parallel_for RAJA::forall thrust::transform |
Sorting | merge_sort merge_sort! |
sort sort_team stable_sort |
merge_sort_by_key merge_sort_by_key! |
sort_team_by_key |
|
merge_sortperm merge_sortperm! |
sort_permutation index_permutation |
|
merge_sortperm_lowmem merge_sortperm_lowmem! |
||
Reduction | reduce |
Kokkos:parallel_reduce fold aggregate |
MapReduce | mapreduce |
transform_reduce fold |
Accumulation | accumulate accumulate! |
prefix_sum thrust::scan cumsum |
Binary Search | searchsortedfirst searchsortedfirst! |
std::lower_bound |
searchsortedlast searchsortedlast! |
thrust::upper_bound |
|
Predicates | all any |
But now I need your help - this is the very first development branch of the library, and before registering it as a Julia package, I’d like the community’s opinions on a few things:
- How do you feel about the library interface? I deliberately haven’t exported any functions by default yet, as some have the same names as Julia Base ones, and I don’t know if these methods should become the “defaults” for JuliaGPU arrays upon
using AcceleratedKernels
. - Should we add CPU alternatives to all algorithms? E.g.
foreachindex
has one,any
does not (and it would probably just delegate work to the Julia Base one). - Should we add a
synchronize::Bool
keyword argument to all functions to avoid the final device synchronisation in case they’re called as part of a longer sequence of kernels? - There are a few other deeper questions in the Roadmap section that I’d really appreciate your thoughts on.
Feedback and contributions are always very welcome.
Best wishes,
Leonard