Bring Intel x86 simd sort library to Julia

I don’t see anything suggesting that in the README of the project, and again, it’d be hard to artificially cripple performance on non-Intel CPUs in open source code as that’d be in plain sight. Their code uses intrinsics for the specific target (avx2, avx512), but beyond that I don’t see how Intel CPUs would be advantaged. And unlike MKL, this isn’t a binary blob, people can compile and tweak it themselves with their favourite compiler. Furthermore, I doubt numpy would take on a default dependency which cripples performance on lots of CPUs (MKL is an optional dependency).

1 Like

In the original post, it was explained quite clearly why AVX2 performance was bad in Julia

zen4 issue : performance on amd 7950x ... · Issue #6 · intel/x86-simd-sort · GitHub


And OpenJDK adapted the code ( OpenJDK Merges Intel’s x86-simd-sort For Speeding Up Data Sorting 7~15x" )

And now patched ( ~ Commits on Mar 30, 2025 ) “Zen 4 to pick optimized AVX2 version of SIMD sort and Zen 5 picks the AVX512 version.” via ( 8317976: Optimize SIMD sort for AMD Zen 4 by rohitarulraj · Pull Request #24053 · openjdk/jdk · GitHub )

“”"
In JDK-8309130, Array sort was optimized using AVX512 SIMD instructions for x86_64. Currently, this optimization has been disabled for AMD Zen 4 [JDK-8317763] due to bad performance of compressstoreu.
Ref: Reddit - The heart of the internet.

This patch enables Zen 4 to pick optimized AVX2 version of SIMD sort and Zen 5 picks the AVX512 version.

JTREG Tests: Completed Tier1 & Tier2 tests on Zen4 & Zen5 - No Regressions.
“”"

As requested here is my code:

/* test.cpp */

#include "src/x86simdsort-static-incl.h"

// g++ test.cpp -mavx512f -mavx512dq -mavx512vl -O3 -o libsort.dll -shared -I./x86-simd-sort -fopenmp -DXSS_USE_OPENMP

extern "C" void qsort_float(float *arr, size_t size) {
    x86simdsortStatic::qsort(arr, size, false, true);
}

extern "C" void qsort_double(double *arr, size_t size) {
    x86simdsortStatic::qsort(arr, size, false, true);
}

/* test.jl */

using BenchmarkTools, Random

qsort!(x::Vector{Cfloat}) = @ccall "./libsort.dll".qsort_float(x::Ptr{Cfloat}, length(x)::Csize_t)::Cvoid
qsort!(x::Vector{Cdouble}) = @ccall "./libsort.dll".qsort_double(x::Ptr{Cdouble}, length(x)::Csize_t)::Cvoid

@benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^8))
BenchmarkTools.Trial: 5 samples with 1 evaluation per sample.
 Range (min … max):  936.871 ms … 975.663 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     947.514 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   950.403 ms ±  15.489 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  █            █       █                                   █  
  █▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  937 ms           Histogram: frequency by time          976 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^8)) # with OMP_NUM_THREADS = 8
BenchmarkTools.Trial: 12 samples with 1 evaluation per sample.
 Range (min … max):  211.874 ms … 227.819 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     218.187 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   218.676 ms ±   5.240 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██ █        █     █ █      █  █          █   █   █          █  
  ██▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁█▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁█ ▁
  212 ms           Histogram: frequency by time          228 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^8))
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 5.034 s (0.01% GC) to evaluate,
 with a memory estimate of 762.95 MiB, over 6 allocations.

For information, this is running on Windows with Julia 1.11.6 on an Intel Xeon Gold 6254 CPU @3.1GHz.

And for context I have thousand of vectors of type double and size 10^8 to sort and run statistics and other calculations on.

1 Like

Wonder if this is better with Google’s Highway, another manually vectorized C++ library? They seem to more explicitly take processors into account and its vqsort showed some advantage in a benchmark by ipnsort’s author a couple years ago. I think any serious benchmark should consider various types (when applicable), input sizes, and implementations like that one, and if its conclusions generalize, then we should expect better performance from manually vectorized libraries than the “generic comparisons” that Julia’s sort probably falls under. Obviously we’d need generic comparisons for generic types.

That’s a relevant issue, but a performance trap in a single instruction in a single series of CPU (the other links you shared suggest that zen5 are ok), unrelated to whoever wrote the code (it’s a CPU issue, not a code one) is very different from MKL-style “oh, you aren’t using a CPU produced by Intel, too bad, I’ll make this code run very slowly just to make your CPU look bad”

Bringing the MKL into this discussion is irrelevant and adding no value.
If it is just to find another opportunity to complain and whining about Intel and use word like “shit” as seen above, it is just childish, and not constructive and I don’t think it has a place in this forum. (Or maybe I am mistaken about this place and I shoudn’t be here)
Especially that you are talking about a company that has provided funding to Julia.

Y’all. What’s with all the bickering here? Please, let’s try to keep this concrete, actionable, and respectful of all who are here. It doesn’t matter who anyone is here; we require respect for all regardless.

  • There’s a C++ library that has better performance on some CPUs
  • How do we get that in Julia?
13 Likes