NumPy is faster than Julia 1.11 for maximum function

ufechner7 · July 3, 2025, 1:18pm

import numpy as np

A = np.rand(2026,51,251)
# %timeit np.max(A)
# 4.49 ms ± 89.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Julia code:

using BenchmarkTools
const A=rand(2026,51,251)
@btime maximum(A)
#   6.946 ms (0 allocations: 0 bytes)

I am using ipython on Linux.

So in this example, Python is 55% faster than Julia.

Any idea?

Oscar_Smith · July 3, 2025, 1:19pm

I wonder if this is Madvise HUGEPAGE for large arrays · Issue #56521 · JuliaLang/julia · GitHub.

Mason · July 3, 2025, 1:20pm

Could be a threads thing.

julia> @btime maximum(A);
  9.027 ms (0 allocations: 0 bytes)

julia> using OhMyThreads

julia> @btime treduce(max, A);
  3.625 ms (70 allocations: 3.86 KiB)

mbauman · July 3, 2025, 1:21pm

What version of Julia are you running? This is slated to improve significantly — by upwards of 6x on some platforms — on Julia v1.13 with julia#58267. I should check if that’s backport-able to v1.12.

On my machine, v1.11 takes 16.732 ms, whereas the nightly is 3.437 ms.

ufechner7 · July 3, 2025, 1:26pm

Version 1.11.5 (2025-04-14)

I tried with Julia nightly, and indeed, I achieve the same performance as with Numpy:

 4.569 ms (0 allocations: 0 bytes)

mbauman · July 3, 2025, 1:37pm

On v1.11 and v1.10, you may see a slight advantage if you spell maximum(A) as reduce((x,y)->max(x,y), A). That’s effectively what the change above does, but it relies upon some other core compiler improvements to really take effect.

julia> using BenchmarkTools

julia> const A=rand(2026,51,251);

julia> @btime maximum(A)
  16.587 ms (0 allocations: 0 bytes)
0.999999980621013

julia> @btime reduce((x,y)->max(x,y), A)
  14.862 ms (0 allocations: 0 bytes)
0.999999980621013

julia> VERSION
v"1.11.5"

Mason · July 3, 2025, 1:38pm

You can get the performance benefit of the nightly changes by just writing a loop:

julia> function mymaximum(A)
           if isempty(A)
               error()
           end
           acc = typemin(eltype(A))
           @simd for x ∈ A
               acc = max(acc, x)
           end
           acc
       end;

julia> @btime mymaximum(A)
  3.896 ms (0 allocations: 0 bytes)
0.9999999569932408

MilesCranmer · July 3, 2025, 1:42pm

I don’t think it’s that unreasonable if NumPy happens to be faster than Julia at some things. It’s just benchmarking one compiled kernel against another. NumPy tends to have a lot of really good arch-specific SIMD vectorizations.

Here is the code executed for max: numpy/numpy/_core/src/umath/loops_minmax.dispatch.c.src at 85708b3c0bd3d2fdbcb00cb08d0a7b323ce3af2d · numpy/numpy · GitHub

The actual python max (this one: max(iterable, *[, default=obj, key=func]) -> value) is much slower.

Mason · July 3, 2025, 1:44pm

Sure. But it’s also not unreasonable to ask why julia ended up slower, and if there’s a fix. Julia tends to attract speed demons, and part of what makes it so fun to use the language (for some people) is that it’s so easy and engaging to tinker with performance problems.

MilesCranmer · July 3, 2025, 1:50pm

Also fair!

MilesCranmer · July 3, 2025, 1:53pm

It is interesting that maximum in Julia is so slow relative to sum:

julia> const A = rand(2026, 51, 251);

julia> @btime sum($A);
  4.212 ms (0 allocations: 0 bytes)

julia> @btime maximum($A);
  18.033 ms (0 allocations: 0 bytes)

whereas in python they are similar (which makes sense I think)

In [1]: import numpy as np

In [2]: A = np.random.rand(2026, 51, 251)

In [4]: %timeit np.sum(A)
9.47 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit np.max(A)
10.7 ms ± 5.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is on an M1 Pro - Python 3.13.0 vs Julia 1.11.5

MilesCranmer · July 3, 2025, 1:55pm

Even foldl is slightly faster:

julia> @btime foldl($max, $A)
  17.376 ms (0 allocations: 0 bytes)
0.999999966901279

julia> @btime maximum($A)
  17.850 ms (0 allocations: 0 bytes)
0.999999966901279

mbauman · July 3, 2025, 2:00pm

Reductions of max and min dispatched (by the type of that exact function) to a specialized kernel that worked around LLVM enable SIMD. It’s because of the particulars of floating point behaviors we need weren’t immediately exposed by LLVM in a SIMD-compatible way. Sums, on the other hand, don’t (and didn’t) have the same difficulty. For example, on v1.6 the generic implementation is catestrophically slower than this optimization.

julia> VERSION
v"1.6.7"

julia> using BenchmarkTools

julia> const A=rand(2026,51,251);

julia> @btime maximum(A)
  22.307 ms (0 allocations: 0 bytes)
0.9999999629120224

julia> @btime reduce((x,y)->max(x,y), A)
  124.803 ms (0 allocations: 0 bytes)
0.9999999629120224

julia> @btime foldl(max, A)
  125.503 ms (0 allocations: 0 bytes)
0.9999999629120224

MilesCranmer · July 3, 2025, 2:02pm

FYI the times get better with LoopVectorization.jl again:

julia> function mymaximum(A)
            if isempty(A)
                error()
            end
            acc = typemin(eltype(A))
            @turbo for i in eachindex(A)
                acc = max(acc, A[i])
            end
            acc
        end;

julia> @btime mymaximum($A)
  4.104 ms (0 allocations: 0 bytes)
0.999999966901279

Long live LoopVectorization <3

mbauman · July 3, 2025, 2:09pm

How does that compare against nightly on your architecture?

MilesCranmer · July 3, 2025, 2:36pm

Nice! Seems it is about equal to LoopVectorization on 1.13.0-DEV.811.

MilesCranmer · July 3, 2025, 2:37pm

(A backport would be lovely )

giordano · July 3, 2025, 2:42pm

Already marked to be backported to v1.12 (but not previous versions, because it needs other compiler changes to be effective).

jling · July 3, 2025, 2:55pm

I actually have a whole write up on this, but basically Numpy is doing fastmath by default.

so if you do @fastmath maximum it should be equivalent.

And in the end if we don’t care about IEEE rules, we can do much better

My real problem with the Julia slow-down is in findmin and findmax, which still is a problem to this day.

karei · July 3, 2025, 3:03pm

np.max calls C to compute it, and it’s to be expected that it’s faster than a pure Julia implementation.
np.max->numpy.core.fromnumeric.amax->_wrapreduction->ufunc.reduce->ufunc_object.c

Wouldn’t it?

Topic		Replies	Views
Vector addition in Julia slower than numpy in Linux Performance	21	1293	May 1, 2020
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	10964	October 15, 2019
Why is this code so slow in julia compared to a numpy implementation? Performance performance	9	3567	October 24, 2017
Benchmarking Julia vs NumPy New to Julia benchmark	9	1679	March 29, 2020
Why is python faster than Julia Performance	14	1916	March 12, 2020

NumPy is faster than Julia 1.11 for maximum function

Related topics