Bring Intel x86 simd sort library to Julia

giordano · August 19, 2025, 10:44pm

I don’t see anything suggesting that in the README of the project, and again, it’d be hard to artificially cripple performance on non-Intel CPUs in open source code as that’d be in plain sight. Their code uses intrinsics for the specific target (avx2, avx512), but beyond that I don’t see how Intel CPUs would be advantaged. And unlike MKL, this isn’t a binary blob, people can compile and tweak it themselves with their favourite compiler. Furthermore, I doubt numpy would take on a default dependency which cripples performance on lots of CPUs (MKL is an optional dependency).

jling · August 20, 2025, 12:36am

In the original post, it was explained quite clearly why AVX2 performance was bad in Julia

ImreSamu · August 20, 2025, 3:39am

zen4 issue : performance on amd 7950x ... · Issue #6 · intel/x86-simd-sort · GitHub

And OpenJDK adapted the code ( OpenJDK Merges Intel’s x86-simd-sort For Speeding Up Data Sorting 7~15x" )

And now patched ( ~ Commits on Mar 30, 2025 ) “Zen 4 to pick optimized AVX2 version of SIMD sort and Zen 5 picks the AVX512 version.” via ( 8317976: Optimize SIMD sort for AMD Zen 4 by rohitarulraj · Pull Request #24053 · openjdk/jdk · GitHub )

“”"
In JDK-8309130, Array sort was optimized using AVX512 SIMD instructions for x86_64. Currently, this optimization has been disabled for AMD Zen 4 [JDK-8317763] due to bad performance of compressstoreu.
Ref: Reddit - The heart of the internet.

This patch enables Zen 4 to pick optimized AVX2 version of SIMD sort and Zen 5 picks the AVX512 version.

JTREG Tests: Completed Tier1 & Tier2 tests on Zen4 & Zen5 - No Regressions.
“”"

gitboy16 · August 20, 2025, 7:43am

As requested here is my code:

/* test.cpp */

#include "src/x86simdsort-static-incl.h"

// g++ test.cpp -mavx512f -mavx512dq -mavx512vl -O3 -o libsort.dll -shared -I./x86-simd-sort -fopenmp -DXSS_USE_OPENMP

extern "C" void qsort_float(float *arr, size_t size) {
    x86simdsortStatic::qsort(arr, size, false, true);
}

extern "C" void qsort_double(double *arr, size_t size) {
    x86simdsortStatic::qsort(arr, size, false, true);
}

/* test.jl */

using BenchmarkTools, Random

qsort!(x::Vector{Cfloat}) = @ccall "./libsort.dll".qsort_float(x::Ptr{Cfloat}, length(x)::Csize_t)::Cvoid
qsort!(x::Vector{Cdouble}) = @ccall "./libsort.dll".qsort_double(x::Ptr{Cdouble}, length(x)::Csize_t)::Cvoid

@benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^8))
BenchmarkTools.Trial: 5 samples with 1 evaluation per sample.
 Range (min … max):  936.871 ms … 975.663 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     947.514 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   950.403 ms ±  15.489 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  █            █       █                                   █  
  █▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  937 ms           Histogram: frequency by time          976 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^8)) # with OMP_NUM_THREADS = 8
BenchmarkTools.Trial: 12 samples with 1 evaluation per sample.
 Range (min … max):  211.874 ms … 227.819 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     218.187 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   218.676 ms ±   5.240 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██ █        █     █ █      █  █          █   █   █          █  
  ██▁█▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁█▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁█ ▁
  212 ms           Histogram: frequency by time          228 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^8))
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 5.034 s (0.01% GC) to evaluate,
 with a memory estimate of 762.95 MiB, over 6 allocations.

For information, this is running on Windows with Julia 1.11.6 on an Intel Xeon Gold 6254 CPU @3.1GHz.

And for context I have thousand of vectors of type double and size 10^8 to sort and run statistics and other calculations on.

Benny · August 20, 2025, 8:30am

Wonder if this is better with Google’s Highway, another manually vectorized C++ library? They seem to more explicitly take processors into account and its vqsort showed some advantage in a benchmark by ipnsort’s author a couple years ago. I think any serious benchmark should consider various types (when applicable), input sizes, and implementations like that one, and if its conclusions generalize, then we should expect better performance from manually vectorized libraries than the “generic comparisons” that Julia’s sort probably falls under. Obviously we’d need generic comparisons for generic types.

giordano · August 20, 2025, 11:00am

That’s a relevant issue, but a performance trap in a single instruction in a single series of CPU (the other links you shared suggest that zen5 are ok), unrelated to whoever wrote the code (it’s a CPU issue, not a code one) is very different from MKL-style “oh, you aren’t using a CPU produced by Intel, too bad, I’ll make this code run very slowly just to make your CPU look bad”

gitboy16 · August 20, 2025, 11:32am

Bringing the MKL into this discussion is irrelevant and adding no value.
If it is just to find another opportunity to complain and whining about Intel and use word like “shit” as seen above, it is just childish, and not constructive and I don’t think it has a place in this forum. (Or maybe I am mistaken about this place and I shoudn’t be here)
Especially that you are talking about a company that has provided funding to Julia.

mbauman · August 20, 2025, 1:26pm

Y’all. What’s with all the bickering here? Please, let’s try to keep this concrete, actionable, and respectful of all who are here. It doesn’t matter who anyone is here; we require respect for all regardless.

There’s a C++ library that has better performance on some CPUs
How do we get that in Julia?

The_Mastermage · August 21, 2025, 11:32am

This is interesting your benchmarks seem to only do a couple of runs. Would it be possible to run this for more runs for a better statistics?

On the second note i think it would be the best to implement this using Binary Builder because whatver the Intrinsic magic the Intel guys did i think thats gonna take a while to translate to julia.

gitboy16 · August 22, 2025, 8:55am

As requested, please see below:

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^1)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):    0.001 ns …  3.800 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     100.000 ns              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   104.680 ns ± 72.447 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               █  
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▂
  0.001 ns        Histogram: frequency by time          200 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^2)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  100.000 ns …   8.200 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     200.000 ns               ┊ GC (median):    0.00%        
 Time  (mean ± σ):   215.220 ns ± 129.352 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%                                                                                                                                               13:18:34

  ▄              █              ▃              ▃              ▁ ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ █
  100 ns        Histogram: log(frequency) by time        500 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^3)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  2.200 μs …  13.900 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.600 μs               ┊ GC (median):    0.00%        
 Time  (mean ± σ):   2.647 μs ± 263.527 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                        ▁      █      ▆
  ▂▁▁▁▁▁▁▃▁▁▁▁▁▁▂▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁▄▁▁▁▁▁▁▃▁▁▁▁▁▁▂ ▂
  2.2 μs          Histogram: frequency by time           3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^4)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  28.900 μs … 209.800 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.456 μs ±   5.122 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▅▄▃▁▂▁        ▁ ▁   ▁ ▁▁▂▁▂▁▂▁▃▁▂▁▂ ▁                     ▁
  ██████████▇█▇█▇█▇███▇███████████████████████▇▇▅▆▆▆▄▅▄▅▄▄▄▄▄▄ █
  28.9 μs       Histogram: log(frequency) by time      43.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^5)) evals=1
BenchmarkTools.Trial: 8791 samples with 1 evaluation per sample.
 Range (min … max):  361.900 μs …  1.821 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     380.000 μs              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   394.810 μs ± 52.006 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▇██▇▇▆▅▅▄▄▂▂▁▁ ▁                                            ▂
  ██████████████████▇█▇▇▇█▇▇▆▇▆█▇████▇██▇█▆███▆▇▇▆▆▆▄▄▄▄▅▃▄▃▄▅ █
  362 μs        Histogram: log(frequency) by time       614 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^6)) evals=1
BenchmarkTools.Trial: 562 samples with 1 evaluation per sample.
 Range (min … max):  5.073 ms …   7.551 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.248 ms               ┊ GC (median):    0.00%        
 Time  (mean ± σ):   5.396 ms ± 347.002 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅▇█▇▅▄▃▃▃ ▁▂ ▃  ▁▁
  ▄▇████████████▇██████▆▅█▆█▆▇▄▇█▆▆██▄▄▁▅▅▁▄▅▆▄▁▁▄▁▄▁▆▁▁▁▁▄▁▄ █
  5.07 ms      Histogram: log(frequency) by time      6.75 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^7)) evals=1
BenchmarkTools.Trial: 55 samples with 1 evaluation per sample.
 Range (min … max):  66.089 ms … 75.794 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     70.031 ms              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   70.140 ms ±  1.682 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁     ▁     █▁ ▆ ▁
  ▄▁▁▁▁▁▁▁▁▁▁▁▄█▁▁▇▇▄█▄▁▇▇▁██▄█▄█▇▇▄▄▇▇▁▁▁▁▄▁▁▄▄▇▁▁▁▄▁▁▁▁▁▁▁▄ ▁
  66.1 ms         Histogram: frequency by time        74.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark qsort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^8)) evals=1
BenchmarkTools.Trial: 5 samples with 1 evaluation per sample.
 Range (min … max):  884.413 ms … 963.384 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     916.043 ms               ┊ GC (median):    0.00%        
 Time  (mean ± σ):   922.357 ms ±  30.320 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                 █     █                 █                 █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  884 ms           Histogram: frequency by time          963 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^1)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):    0.001 ns …  1.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     100.000 ns              ┊ GC (median):    0.00%        
 Time  (mean ± σ):    75.410 ns ± 48.770 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                             █  
  ▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▂
  0.001 ns        Histogram: frequency by time          100 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^2)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  800.000 ns … 158.100 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.718 μs ±   2.838 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▃ ▃     
  ▇██▆█▅▃▂▂▂▂▂▂▂▂▂▂▃▄▄▃▂▂▂▁▂▂▂▂▂▁▂▂▂▂▂▃▄▃▃▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  800 ns           Histogram: frequency by time          6.7 μs <

 Memory estimate: 928 bytes, allocs estimate: 2.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^3)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):   9.300 μs …   8.739 ms  ┊ GC (min … max):  0.00% … 99.45%
 Time  (median):     14.400 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   20.620 μs ± 136.331 μs  ┊ GC (mean ± σ):  11.15% ±  1.72%

   ▄█     
  ▃██▆▇█▆▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▂▂▁▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  9.3 μs          Histogram: frequency by time          104 μs <

 Memory estimate: 10.01 KiB, allocs estimate: 6.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^4)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  136.900 μs …   6.520 ms  ┊ GC (min … max): 0.00% … 96.24%
 Time  (median):     156.600 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   179.795 μs ± 133.857 μs  ┊ GC (mean ± σ):  3.30% ±  6.26%

   ▄█▃    
  ▄███▅▃▃▃▃▅▇▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▁▁▁▂ ▃
  137 μs           Histogram: frequency by time          500 μs <

 Memory estimate: 86.33 KiB, allocs estimate: 6.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^5)) evals=1
BenchmarkTools.Trial: 2169 samples with 1 evaluation per sample.
 Range (min … max):  1.508 ms …  12.167 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.870 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.069 ms ± 553.970 μs  ┊ GC (mean ± σ):  2.30% ± 5.41%

   ▁▆█▅▆▆▄▄▂▁          ▃
  ▂███████████▆▄▃▃▂▃▃▃▆██▇▄▃▂▂▂▂▂▃▂▂▂▃▃▃▃▂▂▂▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  1.51 ms         Histogram: frequency by time        3.79 ms <

 Memory estimate: 789.45 KiB, allocs estimate: 6.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^6)) evals=1
BenchmarkTools.Trial: 138 samples with 1 evaluation per sample.
 Range (min … max):  29.516 ms … 44.118 ms  ┊ GC (min … max): 0.00% … 3.66%
 Time  (median):     33.314 ms              ┊ GC (median):    4.68%
 Time  (mean ± σ):   33.927 ms ±  3.013 ms  ┊ GC (mean ± σ):  3.32% ± 2.39%

   ▁▁ ▄  ▄▃▆ ▁▁▁█▄ ▄  ▄     ▃ ▁    ▁
  ▇██▇█▇▇███▄█████▄█▇▇█▆▆▇▇▇█▆█▄▇▆▄█▄▄▁▆▁▁▄▄▆▁▁▁▄▁▁▄▄▁▁▁▁▁▁▁▄ ▄
  29.5 ms         Histogram: frequency by time        43.8 ms <

 Memory estimate: 7.64 MiB, allocs estimate: 6.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^7)) evals=1
BenchmarkTools.Trial: 10 samples with 1 evaluation per sample.
 Range (min … max):  489.186 ms … 514.649 ms  ┊ GC (min … max): 0.10% … 3.13%
 Time  (median):     497.922 ms               ┊ GC (median):    0.09%
 Time  (mean ± σ):   498.626 ms ±   8.019 ms  ┊ GC (mean ± σ):  0.58% ± 1.09%

  ▁  ▁     ▁ ▁        ▁█          ▁            ▁              ▁
  █▁▁█▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  489 ms           Histogram: frequency by time          515 ms <

 Memory estimate: 76.30 MiB, allocs estimate: 6.

julia> @benchmark sort!(x) setup=(Random.seed!(123); x = rand(Float64, 10^8)) seconds=60 evals=1
BenchmarkTools.Trial: 12 samples with 1 evaluation per sample.
 Range (min … max):  4.647 s …   4.811 s  ┊ GC (min … max): 0.01% … 0.01%
 Time  (median):     4.690 s              ┊ GC (median):    0.01%
 Time  (mean ± σ):   4.707 s ± 54.103 ms  ┊ GC (mean ± σ):  0.04% ± 0.11%

  ▁▁  ▁     █▁      ▁ ▁         ▁    ▁              ▁     ▁
  ██▁▁█▁▁▁▁▁██▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█ ▁
  4.65 s         Histogram: frequency by time        4.81 s <

 Memory estimate: 762.95 MiB, allocs estimate: 6.

gitboy16 · August 22, 2025, 8:59am

Happy to work with anyone who wants to bring it to Julia. I have tried BinaryBuilder, however I don’t have admin rights on my machine. I am on windows, don’t have WSL, don’t have Docker and BinaryBuilder is failing with some known issues.

I remain available to review a potential jll package, test code or answer questions.

DNF · August 22, 2025, 10:31am

Won’t in-place sort with multiple evaluations per sample lead to most of the evaluations working on pre-sorted data?

jakobnissen · August 22, 2025, 11:01am

I don’t think it’s possible to do this in Julia, since there is no Julia API for querying the platform or available CPU instructions.

You can use SIMD.jl, but this provides LLVM-level SIMD, not intrinsics, which this sorting algorithm needs.
You can query LLVM for the available instructions, but this is a) not stable API, and b) does not respect Julia’s compile target flags, so may lead to miscompilations.

As far as I know, there is no API to write high performance platform-specific SIMD in Julia at this moment.

gitboy16 · August 22, 2025, 1:07pm

I have re-run with evals=1, if I understood correctly your concerns?
It seems to make the intel sort results better.
If any issue let me know and I will correct. I can also try to run code if you provide it. Thank you!

DNF · August 22, 2025, 1:57pm

I don’t have any code to provide, it was just a general remark about benchmarking in-place algorithms. Using evals=1 is one solution, though for small vectors you will then get inaccurate measurements.

You might also try something like

@benchmark qsort!(copy!(y, x)) setup=(x=rand(1000); y=similar(x))

which adds an overhead when copying, but this should be small, and consistent between the different sorting algorithms.

gitboy16 · August 22, 2025, 2:23pm

This is what I get:

julia> @benchmark qsort!(copy!(y, x)) setup=(x=rand(1000); y=similar(x))
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.990 μs … 125.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.820 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.876 μs ±   1.294 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁▇▃▂  ▂▆█▆▅█▅▅▃ ▄▇▆▄ ▁▁  ▁
  ▁▁▂▂▁▁▁▄▅▆▄▂▃████▆▆█████████████████▇██▇▅▅▄▄▄▃▂▂▂▃▂▂▂▂▂▁▁▁▁ ▄
  1.99 μs         Histogram: frequency by time        3.88 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sort!(copy!(y, x)) setup=(x=rand(1000); y=similar(x))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):   9.500 μs …   8.663 ms  ┊ GC (min … max):  0.00% … 98.52%
 Time  (median):     19.100 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   26.707 μs ± 157.907 μs  ┊ GC (mean ± σ):  12.54% ±  2.21%

  ▄▆█▆▇▆▅▅▄▃▂▁                                                 ▂
  ██████████████▆▆▄▃▃▃▁▁▃▃▄▇▇███▇█▇▅▆▅▅▅▅▄▅▄▃▃▄▄▄▄▄▆▄▅▄▄▄▅▅▅▁▅ █
  9.5 μs        Histogram: log(frequency) by time       152 μs <

 Memory estimate: 10.01 KiB, allocs estimate: 6.

Topic		Replies	Views
Why is Julia faster than C++ for quicksort? Performance performance , quicksort	15	2087	August 15, 2023
Ironic observation about `sort` and `sortperm` speed for "small integers" vs R Performance sort , sortperm , r	32	4787	February 4, 2018
Squeeze out the last 10% of performance for a sorting function? Performance sort	26	3552	July 18, 2021
In as few lines as possible describe why you love julia New to Julia	113	12722	May 21, 2021
Julia programs now shown on benchmarks game website Community announcement	144	14182	December 3, 2019

Bring Intel x86 simd sort library to Julia

Related topics