Bring Intel x86 simd sort library to Julia

I’d start by checking whether that’s worth at all

#include "x86simdsort.h"

extern "C" void qsort_float(float *arr, size_t size) {
    x86simdsort::qsort(arr, size, true);
}

extern "C" void qsort_double(double *arr, size_t size) {
    x86simdsort::qsort(arr, size, true);
}

Compiled with

g++ -o libsort.so -Wall -O3 -march=native -shared sort.c -L ../builddir -lx86simdsortcpp -Wl,-rpath,../builddir

Then

julia> using BenchmarkTools, Random

julia> qsort!(x::Vector{Cfloat}) = @ccall "./libsort.so".qsort_float(x::Ptr{Cfloat}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 1 method)

julia> qsort!(x::Vector{Cdouble}) = @ccall "./libsort.so".qsort_double(x::Ptr{Cdouble}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 2 methods)

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 140 samples with 1 evaluation per sample.
 Range (min … max):  26.242 ms … 38.469 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     36.768 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.580 ms ±  4.452 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                               ▁ ▁▃ ▂ ▁ ▂
  █▇▃▁▃▁▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▃▁▁▁▁▁▁▃▁▁▁▁▃▁▁▄▃▅█▅██▇███▆█▄ ▃
  26.2 ms         Histogram: frequency by time        38.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 602 samples with 1 evaluation per sample.
 Range (min … max):  4.731 ms … 11.296 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.740 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.880 ms ±  1.054 ms  ┊ GC (mean ± σ):  0.05% ± 0.62%

    ▃             ▅    █▄            ▁▂▄
  ▃▇█▄▁▃▁▁▁▁▂▁▃▃▅▆█▆▄▇▇███▆▅▄▃▄▃▄▅▆▇█████▄▁▁▁▃▁▁▁▁▁▂▁▁▁▂▁▁▁▃ ▄
  4.73 ms        Histogram: frequency by time        9.79 ms <

 Memory estimate: 4.01 MiB, allocs estimate: 6.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 88 samples with 1 evaluation per sample.
 Range (min … max):  44.550 ms … 66.835 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     55.813 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   54.247 ms ±  4.553 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                        ▆██▅▂ ▃
  ▄▄██▄▄▄▁▁▄▅▁▁▁▁▁▄▁▁▄▁▄▁▁▁▄▄▁▅▁▁▁▄▁▄▄▄▅█████▇█▄▅▄▅▄▁▄▁▄▄▁▁▁▄ ▁
  44.5 ms         Histogram: frequency by time        61.3 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 308 samples with 1 evaluation per sample.
 Range (min … max):   9.046 ms … 17.033 ms  ┊ GC (min … max): 1.97% … 1.97%
 Time  (median):     14.814 ms              ┊ GC (median):    1.94%
 Time  (mean ± σ):   14.138 ms ±  2.139 ms  ┊ GC (mean ± σ):  4.03% ± 5.14%

                         ▂█                         ▃ ▄▄▁▄▁
  ▃▁▃▅▆█▃▃▁▁▁▁▁▁▁▁▃▃▃▃▄▄▆███▇▆▃▄▃▁▃▃▁▄▃▇▅▃▅▄▆▅▆▅▆▆▆▇█▇█████▇▄ ▄
  9.05 ms         Histogram: frequency by time        16.9 ms <

 Memory estimate: 8.01 MiB, allocs estimate: 6.

Tested on

julia> versioninfo()
Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 192 × AMD Ryzen Threadripper PRO 7995WX 96-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 192 virtual cores)

To be clear, sorting algorithms are different, but what do you want to achieve?

6 Likes