[ANN] AcceleratedKernels.jl - Cross-architecture parallel algorithms for Julia's GPU backends

Hi everyone,

I’m excited to announce the first public release of AcceleratedKernels.jl, a high-performance library of parallel algorithm building blocks for the Julia ecosystem, targeting:

  • Multithreaded CPUs, and GPUs via:
  • Intel oneAPI
  • AMD ROCm
  • Apple Metal
  • Nvidia CUDA
  • And any future backends added to the JuliaGPU organisation, thanks to the fantastic KernelAbstractions.jl kernel language.

A few highlights:

  • Multithreaded, arithmetic-heavy benchmarks show performance on par with, or faster than C and OpenMP.
    • Perhaps surprisingly, there are cases where Julia is more consistent / predictable with numerical performance than conventional C compilers.
  • GPU performance on the same order of magnitude with official vendor libraries like Nvidia Thrust, but completely backend-agnostic.
  • Exceptional composability with other Julia libraries like MPISort.jl, with which we can do CPU-GPU co-processing (e.g. CPU-GPU co-sorting!) with very good performance.
    • We reached 538-855 GB/s sorting throughput on 200 GPUs (comparable with the highest figure reported in literature of 900 GB/s on 262,144 CPU cores).
  • User-friendliness - you can convert normal Julia for-loops into GPU kernels by swapping for i in eachindex(itr) with AK.foreachindex(itr) do i. See example below:
CPU Code GPU code
# Copy kernel testing throughput

function cpu_copy!(dst, src)
    for i in eachindex(src)
        dst[i] = src[i]
    end
end
import AcceleratedKernels as AK

function gpu_copy!(dst, src)
    AK.foreachindex(src) do i
        dst[i] = src[i]
    end
end

Again, this is only possible because of the unique Julia compilation model, the JuliaGPU organisation work for reusable GPU backend infrastructure, and especially the KernelAbstractions.jl backend-agnostic kernel language. Thank you.


Below is an overview of the currently-implemented algorithms, along with some common names in other libraries for ease of finding / understanding / porting code. If you need other algorithms in your work that may be of general use, please open an issue and we may implement it, help you implement it, or integrate existing code into AcceleratedKernels.jl. See API Examples in the GitHub repository for usage.

Function Family AcceleratedKernels.jl Functions Other Common Names
General Looping foreachindex Kokkos::parallel_for RAJA::forall thrust::transform
Sorting merge_sort merge_sort! sort sort_team stable_sort
merge_sort_by_key merge_sort_by_key! sort_team_by_key
merge_sortperm merge_sortperm! sort_permutation index_permutation
merge_sortperm_lowmem merge_sortperm_lowmem!
Reduction reduce Kokkos:parallel_reduce fold aggregate
MapReduce mapreduce transform_reduce fold
Accumulation accumulate accumulate! prefix_sum thrust::scan cumsum
Binary Search searchsortedfirst searchsortedfirst! std::lower_bound
searchsortedlast searchsortedlast! thrust::upper_bound
Predicates all any

But now I need your help - this is the very first development branch of the library, and before registering it as a Julia package, I’d like the community’s opinions on a few things:

  • How do you feel about the library interface? I deliberately haven’t exported any functions by default yet, as some have the same names as Julia Base ones, and I don’t know if these methods should become the “defaults” for JuliaGPU arrays upon using AcceleratedKernels.
  • Should we add CPU alternatives to all algorithms? E.g. foreachindex has one, any does not (and it would probably just delegate work to the Julia Base one).
  • Should we add a synchronize::Bool keyword argument to all functions to avoid the final device synchronisation in case they’re called as part of a longer sequence of kernels?
  • There are a few other deeper questions in the Roadmap section that I’d really appreciate your thoughts on.

Feedback and contributions are always very welcome.
Best wishes,
Leonard

52 Likes

Very cool! I have been thinking about adding something like foreachindex to KernelAbstractions.jl

Should we add a synchronize::Bool keyword argument to all functions to avoid the final device synchronisation in case they’re called as part of a longer sequence of kernels?

I was surprised about

Since backends synchronize on access to the array.

Thank you @vchuravy - I don’t quite get why the kernel execution would be synchronised automatically after launching _foreachindex_global!? If a user were to run:

AK.foreachindex(src) do i
    dst[i] = src[i]
end

Without the call to synchronize, would the results in dst be immediately available after the kernel launch?

Without the call to synchronize, would the results in dst be immediately available after the kernel launch?

Yes and no, we guarantee stream order for many operations.
So if you were to write:

AK.foreachindex(src) do i
    dst[i] = src[i]
end
AK.foreachindex(dst) do i
    src[i] = dst[i]
end

The operations would execute serially regarding each other.
Similarly, if the user copied dst to the CPU this would also be a stream-ordered operation. So many operations are either synchronization points or are stream-ordered.

By adding a mandatory synchronize to foreachindex you are causing pipeline bubbles.

Explicit synchronize operations are only really needed when handing memory of to a C-API that is not stream ordered (MPI) or when using multiple tasks, since each task gets its own stream.

1 Like

Aha, that is how I imagined it too - so then there are pros and cons to synchronising by default:

  • When kernels are chained together, like in your example, they are implicitly ordered, so it is not beneficial (or even counter-productive) to have that final synchronize.
  • For most users though, when calling a Base-looking function like sort, they’d expect results to be immediately available once it returns; semantics would change silently into async-like calls.
  • Launching kernels on different tasks, calling C APIs, or benchmarking now all have potential for “gotchas” without that synchronize call.

So there are two solutions:

  • Add the synchronize::Bool=true to not be gotcha-by-default.
  • Not include those final synchronize calls at all and leave it up to the user to know when they should synchronize(backend) in user-code.

You have more experience in both developing and using such kernels in the wider ecosystem - what do you think?

2 Likes

The only way to get data out of the GPU after calling sort is to fetch it using an implicitly synchronizing operation. I don’t see any value in making operations synchronizing. The only "gotcha"s are when doing advanced things like sending data between tasks or interoperating with external libraries, in which case the user is likely to know about the need to synchronize anyway. And besides that, we’ve fixed even most of those cases in CUDA.jl 5.4: Memory management mayhem ⋅ JuliaGPU (still to be back-ported to other back-ends).

3 Likes

That is the decision we made across the rest of the ecosystem :slight_smile:

2 Likes

That makes sense, thank you @vchuravy and @maleadt - I’ll remove the backend synchronizations.

If you have any other ideas for the library, they are very welcome :slight_smile:

3 Likes

Awesome stuff! This should be very beneficial for the growth of our GPU ecosystem :smile:

IMO, for now, we should keep these as non-exported and not overshadowing Base until they’ve had some time to percolate throughout the ecosystem and be functionally validated under various circumstances (especially as the number of algorithms grows, if that’s the plan).

I would say yes, if just from a testing and uniformity perspective. Being able to test that the same API works the same way (and with the same results) on CPU and GPU is, I think, one of KA’s strengths, and so I think it makes sense for AK to follow that.

I’m with Valentin and Tim on this - I know it feels like a footgun, but that is just the perils of performant async programming. I think longer-term, the language could benefit from constructs/tools that make it easier to reason about and constrain the inherent asynchronicity of arbitrary code, but I don’t think that’s something we can solve today in AK.

Automated optimisation of e.g. block_size for a given input; can be made algorithm-agnostic.

I would love if this could be left as a tunable (but potentially with known-good defaults). I think there’s a lot of unexplored possibilities around automatic tuning of things like block sizes, so it would be great if I could configure those for each API.


And a question for you: At the JuliaLab, we’re developing pure-Julia linear algebra operations, many of which we’re planning to implement with KA for GPU support. Would you be open to having these operations be contributed to AK, or do you think it would be better if they were put into a separate package? They would depend on at least LinearAlgebra.jl, so I can understand not wanting to add a direct dep - alternatively, they could be an extension for when LinearAlgebra is loaded.

I won’t be offended if you don’t want to include linalg algorithms - just let me know your thoughts :smile:

5 Likes

Thank you very much for taking the time to go through the questions (and the efforts in AMDGPU.jl so it “just works”) @jpsamaroo !

It absolutely is!

A few notes on that - at some point we do need different arguments for the CPU (e.g. num_threads) and GPU (block_size) functions; I prefer to have them at the top level, so they can be optimised for a given problem / data / hardware. They are uniform in the base algorithm arguments, but not the implementation settings; another thing I am keen on with AK is exposing the temporary arrays used, which I can do with the GPU-implemented algorithms, but not the ones that would forward calls to Julia.Base. Also, high performance really is one of the most important targets for AK, so until now I always implemented GPU algorithms in KA, and CPU ones separately, in pure Julia (it really doesn’t get much better than that…) - the best tool for the job and all.

As an example, take:

foreachindex(f, itr, [backend::GPU]; block_size::Int=256)
foreachindex(f, itr, [backend::CPU]; scheduler=:polyester, max_tasks=Threads.nthreads(), min_elems=1)

If used in a simulation library that targets both CPUs and GPUs, they’ll need to expose different settings to tune performance; uniformity is a difficult problem… an alternative I was experimenting with was having settings of both CPU and GPU cases in a single struct, like:

Base.@kwdef struct ForEachIndexSettings
    block_size::Int=256
    scheduler::Symbol=:polyester
    max_threads::Int=Threads.nthreads()
    min_elems::Int=1
end

But now the number of docs users would have to look at explodes and takes away from simplicity. I left that up to library implementors, but now they need two calls based on the backend. If anyone has design ideas here, I’m all ears.

It would always be a tunable parameter, but I was thinking of having utilities automatically optimising e.g. the block_size with an interface like:

AK.@optimize reduce(f, src, init=init, block_size=$block_size) block_size=(64, 128, 256, 512, 1024)

Which would try these combinations and return the best block_size; or maybe even make it general like:

AK.@optimize begin
    reduce(f, src, init=init, block_size=$block_size, switch_below=$switch_below)
    block_size=(64, 128, 256, 512, 1024)
    switch_below=(1, 10, 100, 1000, 10000)
end

Absolutely! (Quite the honour, actually)
In my mind the more unified the efforts are, the better. I don’t mind adding a) well-maintained, big dependencies and/or b) simple, small packages. If anything, I think everyone would benefit from operations being close to (and closely tested with) their parallel GPU building blocks.
Also worth mentioning that I’d be open to transferring the repository to an organisation if that helps with contributions, maintenance, adoption and testing in downstream codes.

5 Likes

Can’t we rely on KA’s CPU back-end for this, rather than having to have separate CPU codepaths in AK? Admittedly the current one has issues, but I’m hoping that the new POCL back-end could help here.

4 Likes

I think this is a more nuanced point - do you expect algorithms written with GPUs in mind to be converted verbatim into optimal / performant CPU code? I was experimenting with the idea for a parallel algorithms standard library back in 2022 (named KALib or Caleb :grinning:), but turned out that while loops were difficult to transpile - and if they did, they’d incur overhead in coercing CPU semantics to behave like GPU ones; do you think PoCL would not suffer from that?

But more generally, I’m not sure an algorithm like GPU merge sort could compete with the layers of optimisation done in Julia Base, which are cheap on a single-threaded CPU, but not with GPU kernel launch costs; then, scheduling more GPU threads is a negligible cost, while CPU ones are heavy (I don’t mean to be ‘teaching grandmother to suck eggs’ here, these are my thoughts I’d be happy to be counter-argued on).

And I don’t think that’s bad - it is not a big effort to implement a few specific algorithms - like sort, mapreduce, accumulate, foreachindex - separately for CPU and GPU (it would have been annoying to do it separately for CUDA, ROCm, oneAPI, Metal, then wrestle with platform compatibility - but you all really solved that problem, which I’m grateful for) to have the best performance in both cases. And I think this only needs to be done once - besides these algorithms in AK, I’d argue that 95% of code that users want to accelerate would be written using these building blocks, with most probably just using foreachindex. For the remaining 5% of ultra-specific algorithms, again, one pure Julia implementation and one KernelAbstractions.jl implementation is not too bad (did that for BVH ray-tracing / contact detection; as KA inlines normal Julia functions, much of the code can be shared between the two implementations anyways); this is the approach I found to give the best performance, at the cost of separate CPU and GPU codepaths - which I think is okay in a foundational library like AcceleratedKernels.jl.

1 Like

Not necessarily, although I hope PoCL gets us close enough (which is to be tested). I did however expect AcceleratedKernels.jl, which both in name and according to the README is strongly related to KernelAbstractions.jl, to provide KA.jl-based kernels and not also include Polyester.jl as a dependency for a CPU fallback. Maybe that’s just me, though.

2 Likes

The KA.jl kernels are the main point of the library, but to make them work uniformly over CPUs too, at least for now, separate CPU implementations are also included (also, foreachindex on CPU has both a Polyester.jl scheduler and a base Threads.@spawn one that I wrote; performance is similar, with one doing better than the other depending on the workload, but the latter has better composability and debugging).

So yes, we effectively have two codepaths / backends: a base Julia/Threads one, and a KernelAbstractions.jl one (people seeing this post off Google, see my previous answer for reasoning); I’ll make this clearer in the README. The KA one should work directly with a PoCL backend, and hopefully AK algorithms can be a good benchmark for its performance against the pure CPU Julia ones.

To be used uniformly for the two codepaths is still a design question - though I’m recently leaning towards exposing both settings, which will be used selectively based on the input array’s backend:

foreachindex(
    f, itr, backend=get_backend(itr);

    # KernelAbstractions.jl / GPU settings
    block_size::Int=256,

    # Julia Base / CPU settings
    scheduler=:threads,
    max_tasks=Threads.nthreads(),
    min_elems=1,
)

I think we should also mention GitHub - tkf/ThreadsX.jl: Parallelized Base functions Takafumi Arakaki’s wonderful multithreaded parallelization of most of the abstractions mentioned above. It should be inspiring and even tough challenge to compare against.

I tried the AK sort_benchmark.jl on the Apple M2 Pro and the ThreadsX.sort! is much faster than the AK with the Metal backend.

Thank you for trying it out! ThreadsX.jl is a lovely library, I used it in the past. I just compared sorting times against it:

# File: sort_benchmark.jl
import ThreadsX
import AcceleratedKernels as AK
using BenchmarkTools

using Random
Random.seed!(0)

# Metal backend; swap with CUDA/CuArray, AMDGPU/ROCArray, oneAPI/oneArray
using Metal
const DeviceArray = MtlArray


# Benchmark settings
const DTYPE = Int64
const N = 10_000_000


function aksort!(v, temp)
    # Separate function to add GPU synchronization
    AK.sort!(v, temp=temp, block_size=256)
    synchronize()
end


println("Julia Base CPU Sort:")
display(@benchmark sort!(v) setup=(v = rand(DTYPE, N)))

println("$DeviceArray AcceleratedKernels GPU Sort:")
temp = DeviceArray(Vector{DTYPE}(undef, N))
display(@benchmark aksort!(v, $temp) setup=(v = DeviceArray(rand(DTYPE, N))))

println("ThreadsX CPU Sort:")
display(@benchmark ThreadsX.sort!(v) setup=(v = rand(DTYPE, N)))

On my Mac M3 Max with 10 “performance cores”, when running with julia --project=. --threads=10 sort_benchmark.jl I get:

  Activating project at `~/Prog/Julia/Packages/SortTestThreadsX`

Julia Base CPU Sort:
BenchmarkTools.Trial: 54 samples with 1 evaluation.
 Range (min … max):  82.757 ms … 104.819 ms  ┊ GC (min … max): 0.00% … 13.88%
 Time  (median):     84.927 ms               ┊ GC (median):    1.90%
 Time  (mean ± σ):   86.504 ms ±   3.949 ms  ┊ GC (mean ± σ):  2.24% ±  2.12%

        ▄█▄                                                     
  ▅▃▁▁▁▅███▅▃▁▆▃▃▅▃▁▁▁▃▁▃▃▁▃▃▁▁▁▁▁▃▁▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃ ▁
  82.8 ms         Histogram: frequency by time         98.3 ms <

 Memory estimate: 76.30 MiB, allocs estimate: 3.

MtlArray AcceleratedKernels GPU Sort:
BenchmarkTools.Trial: 81 samples with 1 evaluation.
 Range (min … max):  42.047 ms …  43.518 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     42.269 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   42.359 ms ± 285.598 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▁█▁▃       ▁  ▃      ▁                                    
  ▄▇▄█████▇▇▇▆▇▄▄█▆▆█▆▆▄▁▁▄█▄▁▄▁▄▁▁▁▁▁▁▁▁▁▄▁▁▄▁▄▁▁▆▁▁▄▁▄▁▁▁▁▁▄ ▁
  42 ms           Histogram: frequency by time         43.2 ms <

 Memory estimate: 103.91 KiB, allocs estimate: 4012.

ThreadsX CPU Sort:
BenchmarkTools.Trial: 103 samples with 1 evaluation.
 Range (min … max):  32.894 ms … 63.143 ms  ┊ GC (min … max):  0.00% … 47.40%
 Time  (median):     40.175 ms              ┊ GC (median):    16.79%
 Time  (mean ± σ):   40.378 ms ±  3.610 ms  ┊ GC (mean ± σ):  16.35% ±  5.09%

                █▃                                             
  ▄▁▁▂▁▁▁▁▁▁▁▁▂████▄▄▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  32.9 ms         Histogram: frequency by time        61.8 ms <

 Memory estimate: 187.33 MiB, allocs estimate: 183908.
1 Like

Naturally, it depends on the amount of data and data types used; for example, with N = 100_000_000, I get:

Julia Base CPU Sort:
Time  (median):     915.282 ms
Memory estimate: 762.95 MiB, allocs estimate: 3

MtlArray AcceleratedKernels GPU Sort:
Time  (median):     505.196 ms
Memory estimate: 122.38 KiB, allocs estimate: 4705

ThreadsX CPU Sort:
Time  (median):     557.266 ms
Memory estimate: 1.87 GiB, allocs estimate: 2206742.

Sorting is one of the more difficult problems to parallelise well on GPUs; still, I think we’re doing okay…

1 Like