Thank you for trying it out! ThreadsX.jl is a lovely library, I used it in the past. I just compared sorting times against it:
# File: sort_benchmark.jl
import ThreadsX
import AcceleratedKernels as AK
using BenchmarkTools
using Random
Random.seed!(0)
# Metal backend; swap with CUDA/CuArray, AMDGPU/ROCArray, oneAPI/oneArray
using Metal
const DeviceArray = MtlArray
# Benchmark settings
const DTYPE = Int64
const N = 10_000_000
function aksort!(v, temp)
# Separate function to add GPU synchronization
AK.sort!(v, temp=temp, block_size=256)
synchronize()
end
println("Julia Base CPU Sort:")
display(@benchmark sort!(v) setup=(v = rand(DTYPE, N)))
println("$DeviceArray AcceleratedKernels GPU Sort:")
temp = DeviceArray(Vector{DTYPE}(undef, N))
display(@benchmark aksort!(v, $temp) setup=(v = DeviceArray(rand(DTYPE, N))))
println("ThreadsX CPU Sort:")
display(@benchmark ThreadsX.sort!(v) setup=(v = rand(DTYPE, N)))
On my Mac M3 Max with 10 “performance cores”, when running with julia --project=. --threads=10 sort_benchmark.jl
I get:
Activating project at `~/Prog/Julia/Packages/SortTestThreadsX`
Julia Base CPU Sort:
BenchmarkTools.Trial: 54 samples with 1 evaluation.
Range (min … max): 82.757 ms … 104.819 ms ┊ GC (min … max): 0.00% … 13.88%
Time (median): 84.927 ms ┊ GC (median): 1.90%
Time (mean ± σ): 86.504 ms ± 3.949 ms ┊ GC (mean ± σ): 2.24% ± 2.12%
▄█▄
▅▃▁▁▁▅███▅▃▁▆▃▃▅▃▁▁▁▃▁▃▃▁▃▃▁▁▁▁▁▃▁▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃ ▁
82.8 ms Histogram: frequency by time 98.3 ms <
Memory estimate: 76.30 MiB, allocs estimate: 3.
MtlArray AcceleratedKernels GPU Sort:
BenchmarkTools.Trial: 81 samples with 1 evaluation.
Range (min … max): 42.047 ms … 43.518 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 42.269 ms ┊ GC (median): 0.00%
Time (mean ± σ): 42.359 ms ± 285.598 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁█▁▃ ▁ ▃ ▁
▄▇▄█████▇▇▇▆▇▄▄█▆▆█▆▆▄▁▁▄█▄▁▄▁▄▁▁▁▁▁▁▁▁▁▄▁▁▄▁▄▁▁▆▁▁▄▁▄▁▁▁▁▁▄ ▁
42 ms Histogram: frequency by time 43.2 ms <
Memory estimate: 103.91 KiB, allocs estimate: 4012.
ThreadsX CPU Sort:
BenchmarkTools.Trial: 103 samples with 1 evaluation.
Range (min … max): 32.894 ms … 63.143 ms ┊ GC (min … max): 0.00% … 47.40%
Time (median): 40.175 ms ┊ GC (median): 16.79%
Time (mean ± σ): 40.378 ms ± 3.610 ms ┊ GC (mean ± σ): 16.35% ± 5.09%
█▃
▄▁▁▂▁▁▁▁▁▁▁▁▂████▄▄▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
32.9 ms Histogram: frequency by time 61.8 ms <
Memory estimate: 187.33 MiB, allocs estimate: 183908.