I think there are enough issues on github and nothing is moving so I’m not gonna spam there, but putting here to see if people have a temporary solution:

julia> using Chairmarks
julia> @be rand(Float32, 512000) minimum samples=500 evals=100
Benchmark: 73 samples with 100 evaluations
min 128.636 μs
median 130.828 μs
mean 131.679 μs
max 150.319 μs
julia> @be rand(Float64, 512000) minimum samples=500 evals=100
Benchmark: 73 samples with 100 evaluations
min 128.678 μs
median 131.226 μs
mean 131.578 μs
max 143.314 μs
julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, znver4)

In [43]: a = numpy.float32(numpy.random.rand(512000))
In [44]: %timeit numpy.min(a)
23.5 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [45]: a = numpy.random.rand(512000)
In [46]: %timeit numpy.min(a)
40.7 µs ± 189 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [47]: numpy.__version__
Out[47]: '1.26.4'

clearly, Numpy is doing SIMD because the speed changes when you use wider type.

In [31]: numpy.min(numpy.array([1.0, numpy.nan, 0.5]))
Out[31]: nan
In [34]: numpy.min(numpy.array([0.0, -0.0]))
Out[34]: -0.0
In [35]: numpy.min(numpy.array([-0.0, 0.0]))
Out[35]: 0.0

I agree I get the same results with “equivalent code”, I get the same performance only when using @fastmath so maybe numpy uses it ?
code :

using PyCall
py"""
import numpy as np
import time
def numpy_min(a):
return np.min(a)
tmean_f32 = 0
tmean_f64 = 0
a_f32 = np.float32(np.random.rand(512000))
a_f64 = np.float64(np.random.rand(512000))
for i in range(1000):
t = time.time()
numpy_min(a_f32)
t = time.time() - t
tmean_f32 += t
t = time.time()
numpy_min(a_f64)
t = time.time() - t
tmean_f64 += t
print("Python :")
print(tmean_f32/1000)
print(tmean_f64/1000)
"""
@fastmath function juju_min(a)
return minimum(a)
end
function main()
tmean_f32 = 0.0
tmean_f64 = 0.0
a_f32 = rand(Float32,512000)
a_f64 = rand(Float64,512000)
for i in 1:1000
t = time_ns()
juju_min(a_f32)
t = time_ns() - t
tmean_f32 += t
t = time_ns()
juju_min(a_f64)
t = time_ns() - t
tmean_f64 += t
end
println("Julia :")
println(1e-9*tmean_f32/1000)
println(1e-9*tmean_f64/1000)
end
main()

results :

Python :
3.199028968811035e-05
7.575440406799317e-05
Julia :
3.69402e-5
7.34801e-5

That is funny Maybe it does some tests just to avoid it being too much unsafe (presision wise), is there an equivalent to force numpy uncare float presision ? Seems like I encontered this a lot, since all maximum operation I do in numerical methods are all fastmath ones so each time profview leads me to this

numpy does this it calls _wrapreduction which test NaN type

def _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs):
passkwargs = {k: v for k, v in kwargs.items()
if v is not np._NoValue}
if type(obj) is not mu.ndarray:
try:
reduction = getattr(obj, method)
except AttributeError:
pass
else:
# This branch is needed for reductions like any which don't
# support a dtype.
if dtype is not None:
return reduction(axis=axis, dtype=dtype, out=out, **passkwargs)
else:
return reduction(axis=axis, out=out, **passkwargs)

Then I guess it goes to c because I couldn’t go further in the source and c may do fastmath kind of things

julia> @be rand(Float64, 512000) minimum samples=100 evals=50
Benchmark: 100 samples with 50 evaluations
min 129.675 μs
median 132.371 μs
mean 134.741 μs
max 175.927 μs
julia> @be rand(Float64, 512000) findmin samples=100 evals=50
Benchmark: 46 samples with 50 evaluations
min 409.566 μs
median 414.868 μs
mean 420.615 μs
max 491.108 μs
julia> @be rand(Float64, 512000) argmin samples=100 evals=50
Benchmark: 9 samples with 50 evaluations
min 2.247 ms
median 2.256 ms
mean 2.284 ms
max 2.346 ms

I can confirm from my experience that Chairmarks.jl is not as precise and consistent as BenchmarkTools.jl. It was also a “sigh” for me when I realize it.

In my experiments LoopVectorization seems to help trigger SIMD instructions.

using BenchmarkTools, LoopVectorization, Random
function minimum_lv(arr)
vmapreduce(identity, min, arr)
end
function minimum_lv_fastmath(arr)
@fastmath vmapreduce(identity, min, arr)
end
arr = rand(Float64, 512000);
arr32 = Float32.(arr);

In [3]: import numpy
In [4]: a = numpy.float32(numpy.random.rand(512000))
In [5]: %timeit numpy.min(a)
85 μs ± 44.7 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)