`minimum()` 3x - 6x slower than `numpy.min()`

jling · October 28, 2024, 7:48pm

I think there are enough issues on github and nothing is moving so I’m not gonna spam there, but putting here to see if people have a temporary solution:

julia> using Chairmarks

julia> @be rand(Float32, 512000) minimum samples=500 evals=100
Benchmark: 73 samples with 100 evaluations
 min    128.636 μs
 median 130.828 μs
 mean   131.679 μs
 max    150.319 μs

julia> @be rand(Float64, 512000) minimum samples=500 evals=100
Benchmark: 73 samples with 100 evaluations
 min    128.678 μs
 median 131.226 μs
 mean   131.578 μs
 max    143.314 μs

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver4)

In [43]: a = numpy.float32(numpy.random.rand(512000))

In [44]: %timeit numpy.min(a)
23.5 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [45]: a = numpy.random.rand(512000)

In [46]: %timeit numpy.min(a)
40.7 µs ± 189 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [47]: numpy.__version__
Out[47]: '1.26.4'

clearly, Numpy is doing SIMD because the speed changes when you use wider type.

As for the issues:

I honestly don’t know which one will fix this, or which one SHOULD be fixed first.

jling · October 28, 2024, 7:54pm

Numpy is only semi-compliant with IEEE754

In [31]: numpy.min(numpy.array([1.0, numpy.nan, 0.5]))
Out[31]: nan

In [34]: numpy.min(numpy.array([0.0, -0.0]))
Out[34]: -0.0

In [35]: numpy.min(numpy.array([-0.0, 0.0]))
Out[35]: 0.0

yolhan_mannes · October 28, 2024, 8:23pm

I agree I get the same results with “equivalent code”, I get the same performance only when using @fastmath so maybe numpy uses it ?
code :

using PyCall

py"""
import numpy as np
import time
def numpy_min(a):
    return np.min(a)

tmean_f32 = 0
tmean_f64 = 0
a_f32 = np.float32(np.random.rand(512000))
a_f64 = np.float64(np.random.rand(512000))
for i in range(1000):
    t = time.time()
    numpy_min(a_f32)
    t = time.time() - t
    tmean_f32 += t
    t = time.time()
    numpy_min(a_f64)
    t = time.time() - t
    tmean_f64 += t
    

print("Python :")
print(tmean_f32/1000)
print(tmean_f64/1000)
"""

@fastmath function juju_min(a)
   return minimum(a) 
end
function main()
    tmean_f32 = 0.0
    tmean_f64 = 0.0
    a_f32 = rand(Float32,512000)
    a_f64 = rand(Float64,512000)
    for i in 1:1000
        t = time_ns()
        juju_min(a_f32)
        t = time_ns() - t
        tmean_f32 += t
        t = time_ns()
        juju_min(a_f64)
        t = time_ns() - t
        tmean_f64 += t
    end
    println("Julia :")
    println(1e-9*tmean_f32/1000)
    println(1e-9*tmean_f64/1000)    
end
main()

results :

Python :
3.199028968811035e-05
7.575440406799317e-05
Julia :
3.69402e-5
7.34801e-5

jling · October 28, 2024, 8:25pm

if Numpy uses fastmath then idk how can it have the correct IEEE 754 semantics.

Because what fastmath does is it basically ignores NaN and makes no distinction between -0.0 and 0.0.

julia> fast_minimum(x) = @fastmath minimum(x)
fast_minimum (generic function with 1 method)

julia> @be rand(Float64, 512000) fast_minimum samples=500 evals=100
Benchmark: 213 samples with 100 evaluations
 min    37.794 μs
 median 39.050 μs
 mean   39.180 μs
 max    47.362 μs

julia> @be rand(Float32, 512000) fast_minimum samples=500 evals=100
Benchmark: 431 samples with 100 evaluations
 min    20.070 μs
 median 21.128 μs
 mean   21.190 μs
 max    24.511 μs

julia> fast_minimum([1.0, NaN, 0.5])
NaN

julia> fast_minimum([-0.0, 0.0])
0.0

julia> fast_minimum([0.0, -0.0])
-0.0


julia> @fastmath foldl(min, [1.0, NaN, 0.5])
0.5

yolhan_mannes · October 28, 2024, 8:31pm

That is funny Maybe it does some tests just to avoid it being too much unsafe (presision wise), is there an equivalent to force numpy uncare float presision ? Seems like I encontered this a lot, since all maximum operation I do in numerical methods are all fastmath ones so each time profview leads me to this

matthias314 · October 28, 2024, 8:41pm

I believe the IEEE 754 minimum of 0.0 and -0.0 (in any order) is -0.0. It seems that numpy doesn’t do that.

yolhan_mannes · October 28, 2024, 8:42pm

Does that mean numpy uses fast Floating point operation or not ? Sorry, I’m not aware enouth on this

jling · October 28, 2024, 8:45pm

you’re right…

yes, but at least they handle NaN I guess, ugh, this is messy

yolhan_mannes · October 28, 2024, 8:46pm

numpy does this it calls _wrapreduction which test NaN type

def _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs):
    passkwargs = {k: v for k, v in kwargs.items()
                  if v is not np._NoValue}

    if type(obj) is not mu.ndarray:
        try:
            reduction = getattr(obj, method)
        except AttributeError:
            pass
        else:
            # This branch is needed for reductions like any which don't
            # support a dtype.
            if dtype is not None:
                return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
            else:
                return reduction(axis=axis, out=out, **passkwargs)

Then I guess it goes to c because I couldn’t go further in the source and c may do fastmath kind of things

jling · October 28, 2024, 8:47pm

I don’t see it test NaN?

yolhan_mannes · October 28, 2024, 8:48pm

Oh true sorry, testing None may not be the same thing

yolhan_mannes · October 28, 2024, 8:52pm

Fastmath julia seems to handle it nicely too so I guess we’re ok if they choosed another version in Main for good reasons

jling · October 28, 2024, 11:28pm

julia> @be rand(Float64, 512000) minimum samples=100 evals=50
Benchmark: 100 samples with 50 evaluations
 min    129.675 μs
 median 132.371 μs
 mean   134.741 μs
 max    175.927 μs

julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 46 samples with 50 evaluations
 min    409.566 μs
 median 414.868 μs
 mean   420.615 μs
 max    491.108 μs

julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
 min    2.247 ms
 median 2.256 ms
 mean   2.284 ms
 max    2.346 ms

…sigh

Pedro · October 29, 2024, 10:37am

I can confirm from my experience that Chairmarks.jl is not as precise and consistent as BenchmarkTools.jl. It was also a “sigh” for me when I realize it.

jling · October 29, 2024, 11:27am

Do you have an example? They seem to agree

julia> @benchmark findmin(x) setup=(x=rand(512000))
BenchmarkTools.Trial: 6634 samples with 1 evaluation.
 Range (min … max):  402.815 μs …  2.079 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     405.450 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   409.490 μs ± 22.419 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▃ ▇▃ ▁▁▁▁                                                ▁
  ▇▃▇██▇███████▆▆▆▇▆▇▇▆▇▇▇▇█▇▇█▇▇█▇▇▇█▇▆█▇▇▇▇▆▆▆▆▅▆▅▅▅▅▆▆▆▅▅▅▅ █
  403 μs        Histogram: log(frequency) by time       443 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark argmin(x) setup=(x=rand(512000))
BenchmarkTools.Trial: 1696 samples with 1 evaluation.
 Range (min … max):  2.194 ms …   4.118 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.351 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.368 ms ± 137.283 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▁▄▆██▆▅▁
  ▂▂▃▃▃▃▃▃▃▄▅████████▆▆▄▄▃▃▃▃▂▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▁▂▂▁▂▂▁▂ ▃
  2.19 ms         Histogram: frequency by time         2.8 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

artemsolod · November 2, 2024, 10:30am

In my experiments LoopVectorization seems to help trigger SIMD instructions.

using BenchmarkTools, LoopVectorization, Random
function minimum_lv(arr)
    vmapreduce(identity, min, arr)
end
function minimum_lv_fastmath(arr)
    @fastmath vmapreduce(identity, min, arr)
end
arr = rand(Float64, 512000);
arr32 = Float32.(arr);

@btime minimum($arr);      # 300.437 μs (0 allocations: 0 bytes)
@btime minimum($arr32);    # 299.563 μs (0 allocations: 0 bytes)
@btime minimum_lv($arr);   # 143.971 μs (0 allocations: 0 bytes)
@btime minimum_lv($arr32); # 72.030 μs (0 allocations: 0 bytes)
@btime minimum_lv_fastmath($arr); # 128.826 μs (0 allocations: 0 bytes)
@btime minimum_lv_fastmath($arr32); # 64.458 μs (0 allocations: 0 bytes)

and numpy:

In [3]: import numpy

In [4]: a = numpy.float32(numpy.random.rand(512000))

In [5]: %timeit numpy.min(a)
85 μs ± 44.7 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

jling · November 2, 2024, 5:15pm

yeah, I pretty much discovered this while trying to move away from LV.jl use SIMD.jl directly instead of LV.jl for `fast_findmin()` by Moelf · Pull Request #84 · JuliaHEP/JetReconstruction.jl · GitHub

Topic		Replies	Views
Faster `min`,`max` for `Float64\|32` Internals & Design performance	38	2149	August 17, 2021
min(NaN,5) = NaN (why no @fastmath?) General Usage nan	47	1919	February 2, 2023
Faster `findmin` without LoopVectorization.jl Performance question	35	905	May 14, 2025
`ifelse(x<y, x, y)` much faster than `min()` Performance question	16	403	October 28, 2024
NumPy is faster than Julia 1.11 for maximum function General Usage performance	25	776	July 4, 2025

`minimum()` 3x - 6x slower than `numpy.min()`

Related topics