What version of Julia are you running? This is slated to improve significantly — by upwards of 6x on some platforms — on Julia v1.13 with julia#58267. I should check if that’s backport-able to v1.12.
On my machine, v1.11 takes 16.732 ms, whereas the nightly is 3.437 ms.
On v1.11 and v1.10, you may see a slight advantage if you spell maximum(A) as reduce((x,y)->max(x,y), A). That’s effectively what the change above does, but it relies upon some other core compiler improvements to really take effect.
julia> using BenchmarkTools
julia> const A=rand(2026,51,251);
julia> @btime maximum(A)
16.587 ms (0 allocations: 0 bytes)
0.999999980621013
julia> @btime reduce((x,y)->max(x,y), A)
14.862 ms (0 allocations: 0 bytes)
0.999999980621013
julia> VERSION
v"1.11.5"
You can get the performance benefit of the nightly changes by just writing a loop:
julia> function mymaximum(A)
if isempty(A)
error()
end
acc = typemin(eltype(A))
@simd for x ∈ A
acc = max(acc, x)
end
acc
end;
julia> @btime mymaximum(A)
3.896 ms (0 allocations: 0 bytes)
0.9999999569932408
I don’t think it’s that unreasonable if NumPy happens to be faster than Julia at some things. It’s just benchmarking one compiled kernel against another. NumPy tends to have a lot of really good arch-specific SIMD vectorizations.
Sure. But it’s also not unreasonable to ask why julia ended up slower, and if there’s a fix. Julia tends to attract speed demons, and part of what makes it so fun to use the language (for some people) is that it’s so easy and engaging to tinker with performance problems.
It is interesting that maximum in Julia is so slow relative to sum:
julia> const A = rand(2026, 51, 251);
julia> @btime sum($A);
4.212 ms (0 allocations: 0 bytes)
julia> @btime maximum($A);
18.033 ms (0 allocations: 0 bytes)
whereas in python they are similar (which makes sense I think)
In [1]: import numpy as np
In [2]: A = np.random.rand(2026, 51, 251)
In [4]: %timeit np.sum(A)
9.47 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit np.max(A)
10.7 ms ± 5.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is on an M1 Pro - Python 3.13.0 vs Julia 1.11.5
Reductions of max and min dispatched (by the type of that exact function) to a specialized kernel that worked around LLVM enable SIMD. It’s because of the particulars of floating point behaviors we need weren’t immediately exposed by LLVM in a SIMD-compatible way. Sums, on the other hand, don’t (and didn’t) have the same difficulty. For example, on v1.6 the generic implementation is catestrophically slower than this optimization.
julia> VERSION
v"1.6.7"
julia> using BenchmarkTools
julia> const A=rand(2026,51,251);
julia> @btime maximum(A)
22.307 ms (0 allocations: 0 bytes)
0.9999999629120224
julia> @btime reduce((x,y)->max(x,y), A)
124.803 ms (0 allocations: 0 bytes)
0.9999999629120224
julia> @btime foldl(max, A)
125.503 ms (0 allocations: 0 bytes)
0.9999999629120224
FYI the times get better with LoopVectorization.jl again:
julia> function mymaximum(A)
if isempty(A)
error()
end
acc = typemin(eltype(A))
@turbo for i in eachindex(A)
acc = max(acc, A[i])
end
acc
end;
julia> @btime mymaximum($A)
4.104 ms (0 allocations: 0 bytes)
0.999999966901279
np.max calls C to compute it, and it’s to be expected that it’s faster than a pure Julia implementation. np.max->numpy.core.fromnumeric.amax->_wrapreduction->ufunc.reduce->ufunc_object.c