Numpy.sort vs Julia sort

I recently tried numpy.sort and it is really fast for a vector of 100 millions floats. It seems to be parallelized by default and/or using AVX-512. It is like 5 times faster on my system.
If someone has a benchmark vs Julia (both single threaded) I would be interested. I did not manage to disable multithreading in my python (if it is using it).

in my experience (not a comprehensive answer, and possibly outdated by now) Julia’s sorting performance is quite good in the “standard” case of single-threaded sort or sort! of a vector. I think you should find that in this context it will meet or exceed numpy on most inputs.

OTOH, I think sortperm, equivalent to np.argsort, has plenty of room for improvement. and partial sorts are also less fully optimized

1 Like

In contrast, numpy.sort seems about 10x slower than Julia for 10^8 floats on my Apple M3 laptop:

julia> using BenchmarkTools, PyCall

julia> numpy = pyimport("numpy");

julia> a = randn(Float32, 10^8);

julia> apy = pycall(numpy.array, PyObject, a); # convert to npy array

julia> @btime sort($a); @btime pycall($(numpy.sort), PyObject, $apy);
  662.400 ms (12 allocations: 572.23 MiB)
  7.028 s (1 allocation: 16 bytes)

(I’m benchmarking numpy.sort called from Julia for convenience here, but this is set up so that no type-conversion overhead for a is included in the timing. Note that PyCall.jl calls Python directly via the libpython C API, so there is minimal overhead once the type conversions are out of the way.)

(It doesn’t look like numpy is using multithreaded sort on my machine? I tried setting a bunch of environment variables to disable numpy multithreading and it made no difference.)

7 Likes

I’ll just add that I’m also seeing that Julia sort is faster than numpy.sort by about 5x for 100 million 64-bit floats. I opened up htop as well to confirm that it was single-threaded performance (only one core spiked).

4 Likes

On a machine with AVX2 I can confirm numpy.sort is faster (but definitely not 10x):

julia> using BenchmarkTools, PyCall

julia> numpy = pyimport("numpy");

julia> a = randn(Float32, 10^8);

julia> apy = pycall(numpy.array, PyObject, a); # convert to npy array

julia> @btime sort($a); @btime pycall($(numpy.sort), PyObject, $apy);
  2.611 s (14 allocations: 850.76 MiB)
  954.375 ms (1 allocation: 16 bytes)

julia> numpy.__version__
"2.1.3"

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 Ă— AMD Ryzen 9 3950X 16-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

There’s a big difference between numpy v1 and v2 (the latter being much faster). Looking at htop I see only a single core being used all time.

6 Likes

If I use @giordano 's code, I have similar results. 2.752s for Julia and 667.190 ms for numpy. My cpu is a 8×Intel Xeon Gold 6254 @3.10GHz with avx-512 and I am using numpy v2. I guess apple M processors don’t support avx and we can’t see that type of performance?
I also guess that Julia sort does not use avx? (Possibly room for improvements?)

1 Like

Yes, AVX is specific to x86, though Apple’s chips have their own vector instructions.

NumPy wraps Intel’s x86-simd-sort C++ library.
Intel Publishes Blazing Fast AVX-512 Sorting Library, Numpy Switching To It For 10~17x Faster Sorts - Phoronix
Many if not all of those contributions were made by an Intel developer, who also wrapped at least one other Intel library to support AVX-512.
Intel Contributes AVX-512 Optimizations To Numpy, Yields Massive Speedups - Phoronix
Though it’s worth noting that this isn’t necessarily the fastest implementation.
sort-research-rs/writeup/intel_avx512/text.md at main · Voultapher/sort-research-rs · GitHub

2 Likes

cc @Lilith, who has done a lot of work on Julia’s sorting.

On a machine with AVX2 I can confirm numpy.sort is faster (but definitely not 10x):

In contrast, numpy.sort seems about 10x slower than Julia for 108 floats on my Apple M3 laptop:

This is because I was optimizing sorting while developing on an apple machine without AVX2. If we had cross-platform benchmark infrastructure in CI, Julia would likely be 10x faster in both cases.

6 Likes

Shouldn’t LLVM be able to optimize for AVX2? Is this kind of code hand-coded?

Compilers are good at auto-vectorizing BLAS-like algorithms (e.g. dot products). I suspect that sorting algorithms need to be hand-coded for SIMD.

1 Like