NumPy is faster than Julia 1.11 for maximum function

import numpy as np

A = np.rand(2026,51,251)
# %timeit np.max(A)
# 4.49 ms ± 89.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Julia code:

using BenchmarkTools
const A=rand(2026,51,251)
@btime maximum(A)
#   6.946 ms (0 allocations: 0 bytes)

I am using ipython on Linux.

So in this example, Python is 55% faster than Julia.

Any idea?

I wonder if this is Madvise HUGEPAGE for large arrays · Issue #56521 · JuliaLang/julia · GitHub.

Could be a threads thing.

julia> @btime maximum(A);
  9.027 ms (0 allocations: 0 bytes)

julia> using OhMyThreads

julia> @btime treduce(max, A);
  3.625 ms (70 allocations: 3.86 KiB)

What version of Julia are you running? This is slated to improve significantly — by upwards of 6x on some platforms — on Julia v1.13 with julia#58267. I should check if that’s backport-able to v1.12.

On my machine, v1.11 takes 16.732 ms, whereas the nightly is 3.437 ms.

10 Likes

Version 1.11.5 (2025-04-14)

I tried with Julia nightly, and indeed, I achieve the same performance as with Numpy:

 4.569 ms (0 allocations: 0 bytes)
2 Likes

On v1.11 and v1.10, you may see a slight advantage if you spell maximum(A) as reduce((x,y)->max(x,y), A). That’s effectively what the change above does, but it relies upon some other core compiler improvements to really take effect.

julia> using BenchmarkTools

julia> const A=rand(2026,51,251);

julia> @btime maximum(A)
  16.587 ms (0 allocations: 0 bytes)
0.999999980621013

julia> @btime reduce((x,y)->max(x,y), A)
  14.862 ms (0 allocations: 0 bytes)
0.999999980621013

julia> VERSION
v"1.11.5"

You can get the performance benefit of the nightly changes by just writing a loop:

julia> function mymaximum(A)
           if isempty(A)
               error()
           end
           acc = typemin(eltype(A))
           @simd for x ∈ A
               acc = max(acc, x)
           end
           acc
       end;

julia> @btime mymaximum(A)
  3.896 ms (0 allocations: 0 bytes)
0.9999999569932408
2 Likes

I don’t think it’s that unreasonable if NumPy happens to be faster than Julia at some things. It’s just benchmarking one compiled kernel against another. NumPy tends to have a lot of really good arch-specific SIMD vectorizations.

Here is the code executed for max: numpy/numpy/_core/src/umath/loops_minmax.dispatch.c.src at 85708b3c0bd3d2fdbcb00cb08d0a7b323ce3af2d · numpy/numpy · GitHub

The actual python max (this one: max(iterable, *[, default=obj, key=func]) -> value) is much slower.

1 Like

Sure. But it’s also not unreasonable to ask why julia ended up slower, and if there’s a fix. Julia tends to attract speed demons, and part of what makes it so fun to use the language (for some people) is that it’s so easy and engaging to tinker with performance problems.

5 Likes

Also fair!

1 Like

It is interesting that maximum in Julia is so slow relative to sum:

julia> const A = rand(2026, 51, 251);

julia> @btime sum($A);
  4.212 ms (0 allocations: 0 bytes)

julia> @btime maximum($A);
  18.033 ms (0 allocations: 0 bytes)

whereas in python they are similar (which makes sense I think)

In [1]: import numpy as np

In [2]: A = np.random.rand(2026, 51, 251)

In [4]: %timeit np.sum(A)
9.47 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit np.max(A)
10.7 ms ± 5.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is on an M1 Pro - Python 3.13.0 vs Julia 1.11.5

Even foldl is slightly faster:

julia> @btime foldl($max, $A)
  17.376 ms (0 allocations: 0 bytes)
0.999999966901279

julia> @btime maximum($A)
  17.850 ms (0 allocations: 0 bytes)
0.999999966901279

Reductions of max and min dispatched (by the type of that exact function) to a specialized kernel that worked around LLVM enable SIMD. It’s because of the particulars of floating point behaviors we need weren’t immediately exposed by LLVM in a SIMD-compatible way. Sums, on the other hand, don’t (and didn’t) have the same difficulty. For example, on v1.6 the generic implementation is catestrophically slower than this optimization.

julia> VERSION
v"1.6.7"

julia> using BenchmarkTools

julia> const A=rand(2026,51,251);

julia> @btime maximum(A)
  22.307 ms (0 allocations: 0 bytes)
0.9999999629120224

julia> @btime reduce((x,y)->max(x,y), A)
  124.803 ms (0 allocations: 0 bytes)
0.9999999629120224

julia> @btime foldl(max, A)
  125.503 ms (0 allocations: 0 bytes)
0.9999999629120224

FYI the times get better with LoopVectorization.jl again:

julia> function mymaximum(A)
            if isempty(A)
                error()
            end
            acc = typemin(eltype(A))
            @turbo for i in eachindex(A)
                acc = max(acc, A[i])
            end
            acc
        end;

julia> @btime mymaximum($A)
  4.104 ms (0 allocations: 0 bytes)
0.999999966901279

Long live LoopVectorization <3

How does that compare against nightly on your architecture?

Nice! Seems it is about equal to LoopVectorization on 1.13.0-DEV.811.

1 Like

(A backport would be lovely :heart_eyes: )

Already marked to be backported to v1.12 (but not previous versions, because it needs other compiler changes to be effective).

1 Like

I actually have a whole write up on this, but basically Numpy is doing fastmath by default.

so if you do @fastmath maximum it should be equivalent.

And in the end if we don’t care about IEEE rules, we can do much better

My real problem with the Julia slow-down is in findmin and findmax, which still is a problem to this day.

8 Likes

np.max calls C to compute it, and it’s to be expected that it’s faster than a pure Julia implementation.
np.max->numpy.core.fromnumeric.amax->_wrapreduction->ufunc.reduce->ufunc_object.c

Wouldn’t it?