[ANN] AcceleratedKernels.jl - Cross-architecture parallel algorithms for Julia's GPU backends

Thank you for trying it out! ThreadsX.jl is a lovely library, I used it in the past. I just compared sorting times against it:

# File: sort_benchmark.jl
import ThreadsX
import AcceleratedKernels as AK
using BenchmarkTools

using Random
Random.seed!(0)

# Metal backend; swap with CUDA/CuArray, AMDGPU/ROCArray, oneAPI/oneArray
using Metal
const DeviceArray = MtlArray


# Benchmark settings
const DTYPE = Int64
const N = 10_000_000


function aksort!(v, temp)
    # Separate function to add GPU synchronization
    AK.sort!(v, temp=temp, block_size=256)
    synchronize()
end


println("Julia Base CPU Sort:")
display(@benchmark sort!(v) setup=(v = rand(DTYPE, N)))

println("$DeviceArray AcceleratedKernels GPU Sort:")
temp = DeviceArray(Vector{DTYPE}(undef, N))
display(@benchmark aksort!(v, $temp) setup=(v = DeviceArray(rand(DTYPE, N))))

println("ThreadsX CPU Sort:")
display(@benchmark ThreadsX.sort!(v) setup=(v = rand(DTYPE, N)))

On my Mac M3 Max with 10 “performance cores”, when running with julia --project=. --threads=10 sort_benchmark.jl I get:

  Activating project at `~/Prog/Julia/Packages/SortTestThreadsX`

Julia Base CPU Sort:
BenchmarkTools.Trial: 54 samples with 1 evaluation.
 Range (min … max):  82.757 ms … 104.819 ms  ┊ GC (min … max): 0.00% … 13.88%
 Time  (median):     84.927 ms               ┊ GC (median):    1.90%
 Time  (mean ± σ):   86.504 ms ±   3.949 ms  ┊ GC (mean ± σ):  2.24% ±  2.12%

        ▄█▄                                                     
  ▅▃▁▁▁▅███▅▃▁▆▃▃▅▃▁▁▁▃▁▃▃▁▃▃▁▁▁▁▁▃▁▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃ ▁
  82.8 ms         Histogram: frequency by time         98.3 ms <

 Memory estimate: 76.30 MiB, allocs estimate: 3.

MtlArray AcceleratedKernels GPU Sort:
BenchmarkTools.Trial: 81 samples with 1 evaluation.
 Range (min … max):  42.047 ms …  43.518 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     42.269 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   42.359 ms ± 285.598 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▁█▁▃       ▁  ▃      ▁                                    
  ▄▇▄█████▇▇▇▆▇▄▄█▆▆█▆▆▄▁▁▄█▄▁▄▁▄▁▁▁▁▁▁▁▁▁▄▁▁▄▁▄▁▁▆▁▁▄▁▄▁▁▁▁▁▄ ▁
  42 ms           Histogram: frequency by time         43.2 ms <

 Memory estimate: 103.91 KiB, allocs estimate: 4012.

ThreadsX CPU Sort:
BenchmarkTools.Trial: 103 samples with 1 evaluation.
 Range (min … max):  32.894 ms … 63.143 ms  ┊ GC (min … max):  0.00% … 47.40%
 Time  (median):     40.175 ms              ┊ GC (median):    16.79%
 Time  (mean ± σ):   40.378 ms ±  3.610 ms  ┊ GC (mean ± σ):  16.35% ±  5.09%

                █▃                                             
  ▄▁▁▂▁▁▁▁▁▁▁▁▂████▄▄▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  32.9 ms         Histogram: frequency by time        61.8 ms <

 Memory estimate: 187.33 MiB, allocs estimate: 183908.
1 Like