Bring Intel x86 simd sort library to Julia

Hi,

Intel x86 simd sort library can be used via C++ template.

If I wanted to make this library available to the maximum of Julia users (a bit like numpy did) what would be the best steps to follow?

Should I use BinaryBuilder.jl?

Thank you

BinaryBuilder would be the way to do this. that said, as an alternative, it shouldn’t be too hard to copy the algorithm to Julia

2 Likes

I’d start by checking whether that’s worth at all

#include "x86simdsort.h"

extern "C" void qsort_float(float *arr, size_t size) {
    x86simdsort::qsort(arr, size, true);
}

extern "C" void qsort_double(double *arr, size_t size) {
    x86simdsort::qsort(arr, size, true);
}

Compiled with

g++ -o libsort.so -Wall -O3 -march=native -shared sort.c -L ../builddir -lx86simdsortcpp -Wl,-rpath,../builddir

Then

julia> using BenchmarkTools, Random

julia> qsort!(x::Vector{Cfloat}) = @ccall "./libsort.so".qsort_float(x::Ptr{Cfloat}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 1 method)

julia> qsort!(x::Vector{Cdouble}) = @ccall "./libsort.so".qsort_double(x::Ptr{Cdouble}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 2 methods)

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 140 samples with 1 evaluation per sample.
 Range (min … max):  26.242 ms … 38.469 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     36.768 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   34.580 ms Β±  4.452 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ                                               ▁ ▁▃ β–‚ ▁ β–‚
  β–ˆβ–‡β–ƒβ–β–ƒβ–β–β–β–β–β–ƒβ–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–ƒβ–ƒβ–β–β–β–β–β–β–ƒβ–β–β–β–β–ƒβ–β–β–„β–ƒβ–…β–ˆβ–…β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–†β–ˆβ–„ β–ƒ
  26.2 ms         Histogram: frequency by time        38.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 602 samples with 1 evaluation per sample.
 Range (min … max):  4.731 ms … 11.296 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     6.740 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   6.880 ms Β±  1.054 ms  β”Š GC (mean Β± Οƒ):  0.05% Β± 0.62%

    β–ƒ             β–…    β–ˆβ–„            ▁▂▄
  β–ƒβ–‡β–ˆβ–„β–β–ƒβ–β–β–β–β–‚β–β–ƒβ–ƒβ–…β–†β–ˆβ–†β–„β–‡β–‡β–ˆβ–ˆβ–ˆβ–†β–…β–„β–ƒβ–„β–ƒβ–„β–…β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–β–β–β–ƒβ–β–β–β–β–β–‚β–β–β–β–‚β–β–β–β–ƒ β–„
  4.73 ms        Histogram: frequency by time        9.79 ms <

 Memory estimate: 4.01 MiB, allocs estimate: 6.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 88 samples with 1 evaluation per sample.
 Range (min … max):  44.550 ms … 66.835 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     55.813 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   54.247 ms Β±  4.553 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                        β–†β–ˆβ–ˆβ–…β–‚ β–ƒ
  β–„β–„β–ˆβ–ˆβ–„β–„β–„β–β–β–„β–…β–β–β–β–β–β–„β–β–β–„β–β–„β–β–β–β–„β–„β–β–…β–β–β–β–„β–β–„β–„β–„β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–„β–…β–„β–…β–„β–β–„β–β–„β–„β–β–β–β–„ ▁
  44.5 ms         Histogram: frequency by time        61.3 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 308 samples with 1 evaluation per sample.
 Range (min … max):   9.046 ms … 17.033 ms  β”Š GC (min … max): 1.97% … 1.97%
 Time  (median):     14.814 ms              β”Š GC (median):    1.94%
 Time  (mean Β± Οƒ):   14.138 ms Β±  2.139 ms  β”Š GC (mean Β± Οƒ):  4.03% Β± 5.14%

                         β–‚β–ˆ                         β–ƒ ▄▄▁▄▁
  β–ƒβ–β–ƒβ–…β–†β–ˆβ–ƒβ–ƒβ–β–β–β–β–β–β–β–β–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–†β–ˆβ–ˆβ–ˆβ–‡β–†β–ƒβ–„β–ƒβ–β–ƒβ–ƒβ–β–„β–ƒβ–‡β–…β–ƒβ–…β–„β–†β–…β–†β–…β–†β–†β–†β–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„ β–„
  9.05 ms         Histogram: frequency by time        16.9 ms <

 Memory estimate: 8.01 MiB, allocs estimate: 6.

Tested on

julia> versioninfo()
Julia Version 1.11.6
Commit 9615af0f269 (2025-07-09 12:58 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 192 Γ— AMD Ryzen Threadripper PRO 7995WX 96-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 192 virtual cores)

To be clear, sorting algorithms are different, but what do you want to achieve?

6 Likes

I am not getting the same benchmark especially that i have avx-512 on my machine. Plus you are are not using multithreading. I have intel sort faster than julia sort by 6x and 20x with multithreading.
Also look at your post:

I think you should compile with the -mavx512f ..etc flags, you also don’t need to build the library, you can use the templates in src.

Care to share what you tried?

Of course I didn’t, julia’s sort! isn’t multi-threaded, so the comparison would be unfair.

Did you notice the -march=native? Also, did you read the README of the library?

The library auto picks the best version depending on the processor it is run on.

2 Likes

Read this instead that is what I tried and with avx-512 julia sort is no way near it. It completely left in the dust for a vector of 10^8 doubles.

It is much easier to reproduce assuming you have the required hardware.

If you don’t have a intel processor with avx-512, there is no point comparing them. That is what the intel’ algorithm has been dsigned for.

Thank you. I wish I could! But I don’t have the knowledge.

what?.. I think if it’s 20x faster single-thread, I might call julia left in the dust.

20x using avx512 and multithreaded. 6x with avx512 only (and one thread)
Julia sort is unfortunetely not implemented for multithreaded.

You insist on not showing the code you run, I insist on showing mine:

#include "src/x86simdsort-static-incl.h"

extern "C" void qsort_float(float *arr, size_t size) {
    x86simdsortStatic::qsort(arr, size, true);
}

extern "C" void qsort_double(double *arr, size_t size) {
    x86simdsortStatic::qsort(arr, size, true);
}

Compiled with

g++ -o libsort.so -Wall -O3 -fPIC -march=native -shared sort.c

Then in Julia

julia> using BenchmarkTools, Random

julia> qsort!(x::Vector{Cfloat}) = @ccall "./libsort.so".qsort_float(x::Ptr{Cfloat}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 1 method)

julia> qsort!(x::Vector{Cdouble}) = @ccall "./libsort.so".qsort_double(x::Ptr{Cdouble}, length(x)::Csize_t)::Cvoid
qsort! (generic function with 2 methods)

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 142 samples with 1 evaluation per sample.
 Range (min … max):  25.982 ms … 38.337 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     36.373 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   34.133 ms Β±  4.446 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†β–ƒ                                              β–ƒβ–‚β–„β–β–ˆβ–„β–‚β–
  β–ˆβ–ˆβ–†β–„β–β–β–β–β–ƒβ–β–ƒβ–ƒβ–β–β–…β–β–β–β–β–β–β–β–„β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–„β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–„ β–ƒ
  26 ms           Histogram: frequency by time        38.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 625 samples with 1 evaluation per sample.
 Range (min … max):  4.554 ms …   9.635 ms  β”Š GC (min … max): 0.00% … 1.84%
 Time  (median):     6.556 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   6.653 ms Β± 980.903 ΞΌs  β”Š GC (mean Β± Οƒ):  0.32% Β± 1.70%

  ▁▆                β–ˆ      ▂▃▄▁▆▃            β–‚β–‚β–‚β–…β–ƒβ–†
  β–ˆβ–ˆβ–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–‚β–β–…β–…β–†β–ˆβ–ˆβ–…β–‡β–‡β–‡β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–„β–„β–ƒβ–†β–ƒβ–…β–…β–†β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–‚β–β–β–ƒβ–‚β–‚ β–„
  4.55 ms         Histogram: frequency by time        8.51 ms <

 Memory estimate: 4.01 MiB, allocs estimate: 6.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 83 samples with 1 evaluation per sample.
 Range (min … max):  44.114 ms … 63.417 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     60.562 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   57.460 ms Β±  6.464 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                                    β–β–ˆβ–„β–„β–‡
  β–…β–†β–…β–…β–ˆβ–…β–β–ƒβ–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–…β–β–ƒβ–β–β–ƒβ–ƒβ–β–β–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–†β–ˆ ▁
  44.1 ms         Histogram: frequency by time          63 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 303 samples with 1 evaluation per sample.
 Range (min … max):   9.049 ms … 26.193 ms  β”Š GC (min … max): 1.96% … 0.49%
 Time  (median):     14.754 ms              β”Š GC (median):    1.53%
 Time  (mean Β± Οƒ):   14.044 ms Β±  2.186 ms  β”Š GC (mean Β± Οƒ):  3.04% Β± 4.35%

                      ▁▃▄ β–‡ β–‚               ▁▁  β–β–†β–ˆβ–…β–ˆβ–ƒβ–„β–…β–†β–ƒβ–
  β–ƒβ–…β–…β–…β–†β–ƒβ–„β–„β–†β–…β–ƒβ–ƒβ–β–β–β–β–ƒβ–ƒβ–…β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–ˆβ–…β–ˆβ–„β–‡β–…β–„β–„β–„β–„β–ƒβ–ˆβ–„β–ƒβ–ˆβ–…β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–† β–…
  9.05 ms         Histogram: frequency by time        16.8 ms <

 Memory estimate: 8.01 MiB, allocs estimate: 6.

which, unsurprisingly, matches the benchmarks I did above.

Did you bother checking the specs of the CPU I used, which I had shared above?

2 Likes

I think their point is it’s not an Intel CPU… which, idk, maybe Intel is doing the MKL shit again?

I don’t think they can do that in open source code, they just use intrinsics.

2 Likes

Just use numpy, it is using it under the hood. I am on my phone, don’t have the code here. Plus it is c++ but i can post it later. And I said 10^8 double not 2^20 float. I will post everything later.

As a matter of fact, I did, and again got similar results.

julia> using CondaPkg, PythonCall, BenchmarkTools, Random
[...]

(jl_paD8yw) pkg> conda add numpy
[...]

julia> numpy = pyimport("numpy")
Python: <module 'numpy' from '/tmp/jl_paD8yw/.CondaPkg/.pixi/envs/default/lib/python3.13/site-packages/numpy/__init__.py'>

julia> @benchmark numpy.sort(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 139 samples with 1 evaluation per sample.
 Range (min … max):  25.406 ms … 46.719 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     35.977 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   35.103 ms Β±  5.572 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▁                           β–β–„β–ˆβ–‚β–ƒ
  β–ˆβ–‡β–ˆβ–ˆβ–…β–β–ƒβ–ƒβ–…β–β–ƒβ–β–β–β–β–ƒβ–β–β–β–ƒβ–β–β–β–β–β–β–β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–ˆβ–ƒβ–†β–ƒβ–β–β–β–„β–ƒβ–β–β–β–„β–ƒβ–„β–ƒβ–†β–ƒβ–„β–β–ƒβ–ƒβ–… β–ƒ
  25.4 ms         Histogram: frequency by time        45.6 ms <

 Memory estimate: 632 bytes, allocs estimate: 29.

julia> @benchmark sort(x) setup=(Random.seed!(123); x = rand(Float32, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 578 samples with 1 evaluation per sample.
 Range (min … max):  4.619 ms … 10.924 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     7.478 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   7.329 ms Β±  1.471 ms  β”Š GC (mean Β± Οƒ):  2.69% Β± 5.54%

   β–ˆ        β–ˆ              ▁▄▂▅▃▃▂▂
  β–†β–ˆβ–ƒβ–ƒβ–β–ƒβ–‚β–ƒβ–ƒβ–…β–ˆβ–ˆβ–„β–ƒβ–…β–†β–…β–„β–„β–…β–„β–‡β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–‡β–„β–‡β–‡β–‡β–…β–ƒβ–ƒβ–β–‚β–ƒβ–‚β–ƒβ–β–ƒβ–‚β–‚β–ƒβ–ƒβ–„β–…β–…β–… β–„
  4.62 ms        Histogram: frequency by time        10.7 ms <

 Memory estimate: 8.01 MiB, allocs estimate: 9.

julia> @benchmark numpy.sort(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 82 samples with 1 evaluation per sample.
 Range (min … max):  44.839 ms … 71.773 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     61.541 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   58.657 ms Β±  6.979 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†                                    β–‚β–‚β–†β–‚β–ƒβ–†β–ˆβ–ƒ
  β–ˆβ–‡β–‡β–β–β–β–β–„β–„β–β–„β–β–β–β–β–β–„β–β–β–β–β–β–β–β–„β–β–β–β–„β–β–β–„β–β–β–„β–β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–β–„β–β–β–β–β–β–β–β–β–β–β–„ ▁
  44.8 ms         Histogram: frequency by time        69.6 ms <

 Memory estimate: 616 bytes, allocs estimate: 28.

julia> @benchmark sort(x) setup=(Random.seed!(123); x = rand(Float64, 2 ^ 20)) evals=1
BenchmarkTools.Trial: 257 samples with 1 evaluation per sample.
 Range (min … max):   9.808 ms … 21.534 ms  β”Š GC (min … max): 3.28% … 2.26%
 Time  (median):     16.556 ms              β”Š GC (median):    2.58%
 Time  (mean Β± Οƒ):   16.506 ms Β±  3.351 ms  β”Š GC (mean Β± Οƒ):  6.49% Β± 7.01%

                β–ˆ ▇▁                 ▁                  ▁▁▁
  β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–ƒβ–…β–„β–†β–ˆβ–‡β–ˆβ–ˆβ–ƒβ–ƒβ–ƒβ–β–β–β–β–β–β–β–ƒβ–ƒβ–„β–…β–…β–‡β–†β–ˆβ–„β–…β–†β–„β–ƒβ–β–ƒβ–β–ƒβ–β–ƒβ–β–ƒβ–ƒβ–…β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡ β–ƒ
  9.81 ms         Histogram: frequency by time        21.1 ms <

 Memory estimate: 16.01 MiB, allocs estimate: 9.

As I said above, these are just the same results as when using the library directly.

julia> @benchmark numpy.sort(x) setup=(Random.seed!(123); x = rand(Float64, 10 ^ 8)) evals=1
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 8.060 s (0.00% GC) to evaluate,
 with a memory estimate of 616 bytes, over 28 allocations.

julia> @benchmark sort(x) setup=(Random.seed!(123); x = rand(Float64, 10 ^ 8)) evals=1
BenchmarkTools.Trial: 2 samples with 1 evaluation per sample.
 Range (min … max):  2.516 s …    2.703 s  β”Š GC (min … max): 0.05% … 5.43%
 Time  (median):     2.609 s               β”Š GC (median):    2.84%
 Time  (mean Β± Οƒ):   2.609 s Β± 132.008 ms  β”Š GC (mean Β± Οƒ):  2.84% Β± 3.81%

  β–ˆ                                                        β–ˆ
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  2.52 s         Histogram: frequency by time          2.7 s <

 Memory estimate: 1.49 GiB, allocs estimate: 9.

Anything else I should try?

1 Like

with that much memory movement, are you sure you’re not seeing some GC artifact? i.e. Python turns off GC when you time it, Julia maybe not?

We are clearly not getting the same results. Sorry for that.
What I don’t understand is that in anotger thread that I linked above you are showing numpy outperforming julia. So it seems that you have inconsistent results.

Did you notice the different CPU (that was AVX2, this is AVX512)?

Which makes no sense at all :rofl:
Anyway let’s put that to bed, I think we are both wasting our time on this one.

Is the library really specific to Intel processors, not x86 with AVX2 or AVX512 in general?