Bring Intel x86 simd sort library to Julia

gitboy16 · August 19, 2025, 8:00pm

Hi,

Intel x86 simd sort library can be used via C++ template.

If I wanted to make this library available to the maximum of Julia users (a bit like numpy did) what would be the best steps to follow?

Should I use BinaryBuilder.jl?

Thank you

Oscar_Smith · August 19, 2025, 8:22pm

BinaryBuilder would be the way to do this. that said, as an alternative, it shouldn’t be too hard to copy the algorithm to Julia

giordano · August 19, 2025, 8:54pm

I’d start by checking whether that’s worth at all

#include "x86simdsort.h"

extern "C" void qsort_float(float *arr, size_t size) {
    x86simdsort::qsort(arr, size, true);
}

extern "C" void qsort_double(double *arr, size_t size) {
    x86simdsort::qsort(arr, size, true);
}

Compiled with

g++ -o libsort.so -Wall -O3 -march=native -shared sort.c -L ../builddir -lx86simdsortcpp -Wl,-rpath,../builddir

Then

julia> using BenchmarkTools, Random

julia> qsort!(x::Vector{Cfloat}) = @ccall "./libsort.so".qsort_float(x::Ptr{Cfloat}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 1 method)

julia> qsort!(x::Vector{Cdouble}) = @ccall "./libsort.so".qsort_double(x::Ptr{Cdouble}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 2 methods)

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 140 samples with 1 evaluation per sample.
 Range (min … max):  26.242 ms … 38.469 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     36.768 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.580 ms ±  4.452 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                               ▁ ▁▃ ▂ ▁ ▂
  █▇▃▁▃▁▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▃▁▁▁▁▁▁▃▁▁▁▁▃▁▁▄▃▅█▅██▇███▆█▄ ▃
  26.2 ms         Histogram: frequency by time        38.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 602 samples with 1 evaluation per sample.
 Range (min … max):  4.731 ms … 11.296 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.740 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.880 ms ±  1.054 ms  ┊ GC (mean ± σ):  0.05% ± 0.62%

    ▃             ▅    █▄            ▁▂▄
  ▃▇█▄▁▃▁▁▁▁▂▁▃▃▅▆█▆▄▇▇███▆▅▄▃▄▃▄▅▆▇█████▄▁▁▁▃▁▁▁▁▁▂▁▁▁▂▁▁▁▃ ▄
  4.73 ms        Histogram: frequency by time        9.79 ms <

 Memory estimate: 4.01 MiB, allocs estimate: 6.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 88 samples with 1 evaluation per sample.
 Range (min … max):  44.550 ms … 66.835 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     55.813 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   54.247 ms ±  4.553 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                        ▆██▅▂ ▃
  ▄▄██▄▄▄▁▁▄▅▁▁▁▁▁▄▁▁▄▁▄▁▁▁▄▄▁▅▁▁▁▄▁▄▄▄▅█████▇█▄▅▄▅▄▁▄▁▄▄▁▁▁▄ ▁
  44.5 ms         Histogram: frequency by time        61.3 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 308 samples with 1 evaluation per sample.
 Range (min … max):   9.046 ms … 17.033 ms  ┊ GC (min … max): 1.97% … 1.97%
 Time  (median):     14.814 ms              ┊ GC (median):    1.94%
 Time  (mean ± σ):   14.138 ms ±  2.139 ms  ┊ GC (mean ± σ):  4.03% ± 5.14%

                         ▂█                         ▃ ▄▄▁▄▁
  ▃▁▃▅▆█▃▃▁▁▁▁▁▁▁▁▃▃▃▃▄▄▆███▇▆▃▄▃▁▃▃▁▄▃▇▅▃▅▄▆▅▆▅▆▆▆▇█▇█████▇▄ ▄
  9.05 ms         Histogram: frequency by time        16.9 ms <

 Memory estimate: 8.01 MiB, allocs estimate: 6.

Tested on

julia> versioninfo()
Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 192 × AMD Ryzen Threadripper PRO 7995WX 96-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 192 virtual cores)

To be clear, sorting algorithms are different, but what do you want to achieve?

gitboy16 · August 19, 2025, 9:03pm

I am not getting the same benchmark especially that i have avx-512 on my machine. Plus you are are not using multithreading. I have intel sort faster than julia sort by 6x and 20x with multithreading.
Also look at your post:

I think you should compile with the -mavx512f ..etc flags, you also don’t need to build the library, you can use the templates in src.

giordano · August 19, 2025, 9:07pm

Care to share what you tried?

Of course I didn’t, julia’s sort! isn’t multi-threaded, so the comparison would be unfair.

Did you notice the -march=native? Also, did you read the README of the library?

The library auto picks the best version depending on the processor it is run on.

gitboy16 · August 19, 2025, 9:12pm

Read this instead that is what I tried and with avx-512 julia sort is no way near it. It completely left in the dust for a vector of 10^8 doubles.

github.com/intel/x86-simd-sort

src/README.md

main

# x86-simd-sort

C++ header file library for SIMD based 16-bit, 32-bit and 64-bit data type
sorting algorithms on x86 processors. We currently have AVX-512 and AVX2 based
implementation of quicksort, quickselect, partialsort, argsort, argselect &
key-value sort. The static methods can be used by including
`src/x86simdsort-static-incl.h` file. Compiling them with the appropriate
compiler flags will choose either the AVX-512 or AVX2 versions. For AVX-512, we
recommend using -march=skylake-avx512 for 32-bit and 64-bit datatypes,
-march=icelake-client for 16-bit datatype and -march=sapphirerapids for
_Float16. For AVX2 just using -mavx2 will suffice. The following API's are
currently supported:

#### Quicksort

Equivalent to `qsort` in
[C](https://www.tutorialspoint.com/c_standard_library/c_function_qsort.htm) or
`std::sort` in [C++](https://en.cppreference.com/w/cpp/algorithm/sort).

```cpp

This file has been truncated. show original

It is much easier to reproduce assuming you have the required hardware.

gitboy16 · August 19, 2025, 9:14pm

If you don’t have a intel processor with avx-512, there is no point comparing them. That is what the intel’ algorithm has been dsigned for.

gitboy16 · August 19, 2025, 9:19pm

Thank you. I wish I could! But I don’t have the knowledge.

jling · August 19, 2025, 9:19pm

what?.. I think if it’s 20x faster single-thread, I might call julia left in the dust.

gitboy16 · August 19, 2025, 9:21pm

20x using avx512 and multithreaded. 6x with avx512 only (and one thread)
Julia sort is unfortunetely not implemented for multithreaded.

giordano · August 19, 2025, 9:24pm

You insist on not showing the code you run, I insist on showing mine:

#include "src/x86simdsort-static-incl.h"

extern "C" void qsort_float(float *arr, size_t size) {
    x86simdsortStatic::qsort(arr, size, true);
}

extern "C" void qsort_double(double *arr, size_t size) {
    x86simdsortStatic::qsort(arr, size, true);
}

Compiled with

g++ -o libsort.so -Wall -O3 -fPIC -march=native -shared sort.c

Then in Julia

julia> using BenchmarkTools, Random

julia> qsort!(x::Vector{Cfloat}) = @ccall "./libsort.so".qsort_float(x::Ptr{Cfloat}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 1 method)

julia> qsort!(x::Vector{Cdouble}) = @ccall "./libsort.so".qsort_double(x::Ptr{Cdouble}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 2 methods)

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 142 samples with 1 evaluation per sample.
 Range (min … max):  25.982 ms … 38.337 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     36.373 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.133 ms ±  4.446 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▃                                              ▃▂▄▁█▄▂▁
  ██▆▄▁▁▁▁▃▁▃▃▁▁▅▁▁▁▁▁▁▁▄▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▄▆▇████████▆▆▄ ▃
  26 ms           Histogram: frequency by time        38.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 625 samples with 1 evaluation per sample.
 Range (min … max):  4.554 ms …   9.635 ms  ┊ GC (min … max): 0.00% … 1.84%
 Time  (median):     6.556 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.653 ms ± 980.903 μs  ┊ GC (mean ± σ):  0.32% ± 1.70%

  ▁▆                █      ▂▃▄▁▆▃            ▂▂▂▅▃▆
  ██▃▃▁▁▁▁▁▁▁▁▁▂▁▅▅▆██▅▇▇▇▅██████▇▆▄▄▃▆▃▅▅▆▅▇███████▆▄▃▂▁▁▃▂▂ ▄
  4.55 ms         Histogram: frequency by time        8.51 ms <

 Memory estimate: 4.01 MiB, allocs estimate: 6.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 83 samples with 1 evaluation per sample.
 Range (min … max):  44.114 ms … 63.417 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     60.562 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   57.460 ms ±  6.464 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                    ▁█▄▄▇
  ▅▆▅▅█▅▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▅▁▃▁▁▃▃▁▁█▆██████▅▆█ ▁
  44.1 ms         Histogram: frequency by time          63 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 303 samples with 1 evaluation per sample.
 Range (min … max):   9.049 ms … 26.193 ms  ┊ GC (min … max): 1.96% … 0.49%
 Time  (median):     14.754 ms              ┊ GC (median):    1.53%
 Time  (mean ± σ):   14.044 ms ±  2.186 ms  ┊ GC (mean ± σ):  3.04% ± 4.35%

                      ▁▃▄ ▇ ▂               ▁▁  ▁▆█▅█▃▄▅▆▃▁
  ▃▅▅▅▆▃▄▄▆▅▃▃▁▁▁▁▃▃▅▄█████▅█▅█▄▇▅▄▄▄▄▃█▄▃█▅███▇███████████▆▆ ▅
  9.05 ms         Histogram: frequency by time        16.8 ms <

 Memory estimate: 8.01 MiB, allocs estimate: 6.

which, unsurprisingly, matches the benchmarks I did above.

Did you bother checking the specs of the CPU I used, which I had shared above?

jling · August 19, 2025, 9:26pm

I think their point is it’s not an Intel CPU… which, idk, maybe Intel is doing the MKL shit again?

giordano · August 19, 2025, 9:28pm

I don’t think they can do that in open source code, they just use intrinsics.

gitboy16 · August 19, 2025, 9:29pm

Just use numpy, it is using it under the hood. I am on my phone, don’t have the code here. Plus it is c++ but i can post it later. And I said 10^8 double not 2^20 float. I will post everything later.

giordano · August 19, 2025, 9:36pm

As a matter of fact, I did, and again got similar results.

julia> using CondaPkg, PythonCall, BenchmarkTools, Random
[...]

(jl_paD8yw) pkg> conda add numpy
[...]

julia> numpy = pyimport("numpy")
Python: <module 'numpy' from '/tmp/jl_paD8yw/.CondaPkg/.pixi/envs/default/lib/python3.13/site-packages/numpy/__init__.py'>

julia> @benchmark numpy.sort(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 139 samples with 1 evaluation per sample.
 Range (min … max):  25.406 ms … 46.719 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     35.977 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.103 ms ±  5.572 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁                           ▁▄█▂▃
  █▇██▅▁▃▃▅▁▃▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▅█████▇▆▇█▃▆▃▁▁▁▄▃▁▁▁▄▃▄▃▆▃▄▁▃▃▅ ▃
  25.4 ms         Histogram: frequency by time        45.6 ms <

 Memory estimate: 632 bytes, allocs estimate: 29.

julia> @benchmark sort(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 578 samples with 1 evaluation per sample.
 Range (min … max):  4.619 ms … 10.924 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.478 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.329 ms ±  1.471 ms  ┊ GC (mean ± σ):  2.69% ± 5.54%

   █        █              ▁▄▂▅▃▃▂▂
  ▆█▃▃▁▃▂▃▃▅██▄▃▅▆▅▄▄▅▄▇▆▇█████████▆▄▇▄▇▇▇▅▃▃▁▂▃▂▃▁▃▂▂▃▃▄▅▅▅ ▄
  4.62 ms        Histogram: frequency by time        10.7 ms <

 Memory estimate: 8.01 MiB, allocs estimate: 9.

julia> @benchmark numpy.sort(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 82 samples with 1 evaluation per sample.
 Range (min … max):  44.839 ms … 71.773 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     61.541 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   58.657 ms ±  6.979 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆                                    ▂▂▆▂▃▆█▃
  █▇▇▁▁▁▁▄▄▁▄▁▁▁▁▁▄▁▁▁▁▁▁▁▄▁▁▁▄▁▁▄▁▁▄▁▇████████▅▁▄▁▁▁▁▁▁▁▁▁▁▄ ▁
  44.8 ms         Histogram: frequency by time        69.6 ms <

 Memory estimate: 616 bytes, allocs estimate: 28.

julia> @benchmark sort(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 257 samples with 1 evaluation per sample.
 Range (min … max):   9.808 ms … 21.534 ms  ┊ GC (min … max): 3.28% … 2.26%
 Time  (median):     16.556 ms              ┊ GC (median):    2.58%
 Time  (mean ± σ):   16.506 ms ±  3.351 ms  ┊ GC (mean ± σ):  6.49% ± 7.01%

                █ ▇▁                 ▁                  ▁▁▁
  ▃▃▁▁▁▁▁▁▁▁▃▅▄▆█▇██▃▃▃▁▁▁▁▁▁▁▃▃▄▅▅▇▆█▄▅▆▄▃▁▃▁▃▁▃▁▃▃▅▅██████▇ ▃
  9.81 ms         Histogram: frequency by time        21.1 ms <

 Memory estimate: 16.01 MiB, allocs estimate: 9.

As I said above, these are just the same results as when using the library directly.

julia> @benchmark numpy.sort(x) setup=(Random.seed!(123); x = rand(Float64, 10 ^ 8)) evals=1
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 8.060 s (0.00% GC) to evaluate,
 with a memory estimate of 616 bytes, over 28 allocations.

julia> @benchmark sort(x) setup=(Random.seed!(123); x = rand(Float64, 10 ^ 8)) evals=1
BenchmarkTools.Trial: 2 samples with 1 evaluation per sample.
 Range (min … max):  2.516 s …    2.703 s  ┊ GC (min … max): 0.05% … 5.43%
 Time  (median):     2.609 s               ┊ GC (median):    2.84%
 Time  (mean ± σ):   2.609 s ± 132.008 ms  ┊ GC (mean ± σ):  2.84% ± 3.81%

  █                                                        █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.52 s         Histogram: frequency by time          2.7 s <

 Memory estimate: 1.49 GiB, allocs estimate: 9.

Anything else I should try?

jling · August 19, 2025, 9:37pm

with that much memory movement, are you sure you’re not seeing some GC artifact? i.e. Python turns off GC when you time it, Julia maybe not?

gitboy16 · August 19, 2025, 9:38pm

We are clearly not getting the same results. Sorry for that.
What I don’t understand is that in anotger thread that I linked above you are showing numpy outperforming julia. So it seems that you have inconsistent results.

giordano · August 19, 2025, 9:44pm

Did you notice the different CPU (that was AVX2, this is AVX512)?

gitboy16 · August 19, 2025, 9:58pm

Which makes no sense at all
Anyway let’s put that to bed, I think we are both wasting our time on this one.

Benny · August 19, 2025, 10:30pm

Is the library really specific to Intel processors, not x86 with AVX2 or AVX512 in general?

Topic		Replies	Views
Why is Julia faster than C++ for quicksort? Performance performance , quicksort	15	2091	August 15, 2023
Ironic observation about `sort` and `sortperm` speed for "small integers" vs R Performance sort , sortperm , r	32	4788	February 4, 2018
Squeeze out the last 10% of performance for a sorting function? Performance sort	26	3556	July 18, 2021
In as few lines as possible describe why you love julia New to Julia	113	12743	May 21, 2021
Julia programs now shown on benchmarks game website Community announcement	144	14242	December 3, 2019

Bring Intel x86 simd sort library to Julia

Related topics