I recently tried numpy.sort and it is really fast for a vector of 100 millions floats. It seems to be parallelized by default and/or using AVX-512. It is like 5 times faster on my system.
If someone has a benchmark vs Julia (both single threaded) I would be interested. I did not manage to disable multithreading in my python (if it is using it).
in my experience (not a comprehensive answer, and possibly outdated by now) Julia’s sorting performance is quite good in the “standard” case of single-threaded sort
or sort!
of a vector. I think you should find that in this context it will meet or exceed numpy on most inputs.
OTOH, I think sortperm
, equivalent to np.argsort
, has plenty of room for improvement. and partial sorts are also less fully optimized
In contrast, numpy.sort
seems about 10x slower than Julia for 10^8 floats on my Apple M3 laptop:
julia> using BenchmarkTools, PyCall
julia> numpy = pyimport("numpy");
julia> a = randn(Float32, 10^8);
julia> apy = pycall(numpy.array, PyObject, a); # convert to npy array
julia> @btime sort($a); @btime pycall($(numpy.sort), PyObject, $apy);
662.400 ms (12 allocations: 572.23 MiB)
7.028 s (1 allocation: 16 bytes)
(I’m benchmarking numpy.sort
called from Julia for convenience here, but this is set up so that no type-conversion overhead for a
is included in the timing. Note that PyCall.jl calls Python directly via the libpython
C API, so there is minimal overhead once the type conversions are out of the way.)
(It doesn’t look like numpy is using multithreaded sort on my machine? I tried setting a bunch of environment variables to disable numpy multithreading and it made no difference.)
I’ll just add that I’m also seeing that Julia sort
is faster than numpy.sort
by about 5x for 100 million 64-bit floats. I opened up htop
as well to confirm that it was single-threaded performance (only one core spiked).
On a machine with AVX2 I can confirm numpy.sort
is faster (but definitely not 10x):
julia> using BenchmarkTools, PyCall
julia> numpy = pyimport("numpy");
julia> a = randn(Float32, 10^8);
julia> apy = pycall(numpy.array, PyObject, a); # convert to npy array
julia> @btime sort($a); @btime pycall($(numpy.sort), PyObject, $apy);
2.611 s (14 allocations: 850.76 MiB)
954.375 ms (1 allocation: 16 bytes)
julia> numpy.__version__
"2.1.3"
julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 Ă— AMD Ryzen 9 3950X 16-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)
There’s a big difference between numpy v1 and v2 (the latter being much faster). Looking at htop I see only a single core being used all time.
If I use @giordano 's code, I have similar results. 2.752s for Julia and 667.190 ms for numpy. My cpu is a 8×Intel Xeon Gold 6254 @3.10GHz with avx-512 and I am using numpy v2. I guess apple M processors don’t support avx and we can’t see that type of performance?
I also guess that Julia sort does not use avx? (Possibly room for improvements?)
Yes, AVX is specific to x86, though Apple’s chips have their own vector instructions.
NumPy wraps Intel’s x86-simd-sort C++ library.
Intel Publishes Blazing Fast AVX-512 Sorting Library, Numpy Switching To It For 10~17x Faster Sorts - Phoronix
Many if not all of those contributions were made by an Intel developer, who also wrapped at least one other Intel library to support AVX-512.
Intel Contributes AVX-512 Optimizations To Numpy, Yields Massive Speedups - Phoronix
Though it’s worth noting that this isn’t necessarily the fastest implementation.
sort-research-rs/writeup/intel_avx512/text.md at main · Voultapher/sort-research-rs · GitHub
cc @Lilith, who has done a lot of work on Julia’s sorting.
On a machine with AVX2 I can confirm
numpy.sort
is faster (but definitely not 10x):
In contrast,
numpy.sort
seems about 10x slower than Julia for 108 floats on my Apple M3 laptop:
This is because I was optimizing sorting while developing on an apple machine without AVX2. If we had cross-platform benchmark infrastructure in CI, Julia would likely be 10x faster in both cases.
Shouldn’t LLVM be able to optimize for AVX2? Is this kind of code hand-coded?
Compilers are good at auto-vectorizing BLAS-like algorithms (e.g. dot products). I suspect that sorting algorithms need to be hand-coded for SIMD.