I’d start by checking whether that’s worth at all
#include "x86simdsort.h"
extern "C" void qsort_float(float *arr, size_t size) {
x86simdsort::qsort(arr, size, true);
}
extern "C" void qsort_double(double *arr, size_t size) {
x86simdsort::qsort(arr, size, true);
}
Compiled with
g++ -o libsort.so -Wall -O3 -march=native -shared sort.c -L ../builddir -lx86simdsortcpp -Wl,-rpath,../builddir
Then
julia> using BenchmarkTools, Random
julia> qsort!(x::Vector{Cfloat}) = @ccall "./libsort.so".qsort_float(x::Ptr{Cfloat}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 1 method)
julia> qsort!(x::Vector{Cdouble}) = @ccall "./libsort.so".qsort_double(x::Ptr{Cdouble}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 2 methods)
julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 140 samples with 1 evaluation per sample.
Range (min … max): 26.242 ms … 38.469 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 36.768 ms ┊ GC (median): 0.00%
Time (mean ± σ): 34.580 ms ± 4.452 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁ ▁▃ ▂ ▁ ▂
█▇▃▁▃▁▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▃▁▁▁▁▁▁▃▁▁▁▁▃▁▁▄▃▅█▅██▇███▆█▄ ▃
26.2 ms Histogram: frequency by time 38.4 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 602 samples with 1 evaluation per sample.
Range (min … max): 4.731 ms … 11.296 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.740 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.880 ms ± 1.054 ms ┊ GC (mean ± σ): 0.05% ± 0.62%
▃ ▅ █▄ ▁▂▄
▃▇█▄▁▃▁▁▁▁▂▁▃▃▅▆█▆▄▇▇███▆▅▄▃▄▃▄▅▆▇█████▄▁▁▁▃▁▁▁▁▁▂▁▁▁▂▁▁▁▃ ▄
4.73 ms Histogram: frequency by time 9.79 ms <
Memory estimate: 4.01 MiB, allocs estimate: 6.
julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 88 samples with 1 evaluation per sample.
Range (min … max): 44.550 ms … 66.835 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 55.813 ms ┊ GC (median): 0.00%
Time (mean ± σ): 54.247 ms ± 4.553 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▆██▅▂ ▃
▄▄██▄▄▄▁▁▄▅▁▁▁▁▁▄▁▁▄▁▄▁▁▁▄▄▁▅▁▁▁▄▁▄▄▄▅█████▇█▄▅▄▅▄▁▄▁▄▄▁▁▁▄ ▁
44.5 ms Histogram: frequency by time 61.3 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 308 samples with 1 evaluation per sample.
Range (min … max): 9.046 ms … 17.033 ms ┊ GC (min … max): 1.97% … 1.97%
Time (median): 14.814 ms ┊ GC (median): 1.94%
Time (mean ± σ): 14.138 ms ± 2.139 ms ┊ GC (mean ± σ): 4.03% ± 5.14%
▂█ ▃ ▄▄▁▄▁
▃▁▃▅▆█▃▃▁▁▁▁▁▁▁▁▃▃▃▃▄▄▆███▇▆▃▄▃▁▃▃▁▄▃▇▅▃▅▄▆▅▆▅▆▆▆▇█▇█████▇▄ ▄
9.05 ms Histogram: frequency by time 16.9 ms <
Memory estimate: 8.01 MiB, allocs estimate: 6.
Tested on
julia> versioninfo()
Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 192 × AMD Ryzen Threadripper PRO 7995WX 96-Cores
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 192 virtual cores)
To be clear, sorting algorithms are different, but what do you want to achieve?