Yes, llvm is fairly smart in general.

Also, my implementation can be made a bit faster by inlining `hor_min`

and adding `@inbounds`

to the calculation of `min_index`

.

Yes, llvm is fairly smart in general.

Also, my implementation can be made a bit faster by inlining `hor_min`

and adding `@inbounds`

to the calculation of `min_index`

.

But does smart help here? What guarantee is there that `x`

hasn’t been modified between the loads? Doesn’t the code *require* a new load, for correctness?

fixed

Be careful with that. LLVM will silently generate nonsense code for non-power-of-2 sizes. Afaiu there is no consensus in the llvm community whether this is a bug or whether this idiom for pmovmskb only works coincidentally. So you need pretty comprehensive unit tests for that, in order to yell if an llvm update breaks this.

The relevant point of contention appears to be that the bitcast langref refers to the “in memory representation” of the datatypes, but i1 has no in-memory representation. Yes, this is very silly and annoying. I hate language-lawyering with a passion.

1 Like

Thanks @jling for bringing this up. I have been very interested in the discussion.

Just to give a bit of context here, in this problem the metric `dij`

is calculated for a set of clusters (usually 300-500 at the beginning) and each iteration we need to find this minimum value. Then two clusters are merged to one, and some of the `dij`

values are updated. The next minimum is found, a merger happens, and so on, until the process finishes.

This is why the realistic test is to take a random array of length, say, 400. Find the minimum, then reduce to 399, 398, 397, …, 1. Adding some update to the array values each iteration makes it even more realistic (prevents unrealistic caching).

This is my wrapper to test any findmin implementation `f`

on vector `v`

:

```
function run_descent(v::DenseVector{Float64}, f::T; perturb = 0) where T
# Ensure we do something with the calculation to prevent the
# compiler from optimizing everything away!
sum = 0.0
for n in length(v):-1:2
val, _ = f(v, n)
sum += val
# If one wants to perturb the array do it like this, which
# is a proxy for changing values as the algorithm progresses.
for _ in 1:min(perturb,n)
v[rand(1:n)] = abs(rand())
end
end
sum
end
```

I introduced the trick with `LoopVectorisation`

when I realised that the out-of-the-box Julia `findmin`

was really slow. However, now I am realising that even just a basic naive loop is a lot faster than `findmin`

, so that a lot of the speed up may have nothing to do with vectorisation.

I wrapped up all of the suggestions so far into a script and this is what I get on two machines:

```
Running the benchmark for an array of length 400
fast_findmin min: 25.333000000000002 μs; mean: 26.46452926525524 μs
basic_findmin min: 83.25 μs; mean: 85.91027592592592 μs
julia_findmin min: 360.791 μs; mean: 366.5392078431371 μs
naive_findmin min: 23.167 μs; mean: 36.22719819819824 μs
naive_findmin_reduce min: 24.875 μs; mean: 36.793294815987366 μs
fast_findmin_simd min: 42.291000000000004 μs; mean: 44.39575992348151 μs
```

```
Running the benchmark for an array of length 400
fast_findmin min: 16.721 μs; mean: 16.934860248447162 μs
basic_findmin min: 22.782 μs; mean: 23.006807827395896 μs
julia_findmin min: 82.84500000000001 μs; mean: 88.04729608404968 μs
naive_findmin min: 17.352000000000004 μs; mean: 33.093611011154984 μs
naive_findmin_reduce min: 18.884999999999998 μs; mean: 34.14370736589271 μs
fast_findmin_simd min: 16.54 μs; mean: 18.151618091451382 μs
```

The platform differences are revealing, particularly that the `basic_findmin`

is quite bad on the M2, but performs well on the x86_64. Julia’s `findmin`

is particularly bad on M2.

Nothing is quite beating the `LoopVectorization`

version `fast_findmin`

, but `naive_findmin`

is quite respectable on both platforms, though slower than `basic_findmin`

on x86. `fast_findmin_simd`

is good on x86, not great on M2 (which only has neon, as opposed to avx2).

`fast_findmin_simd`

is also pretty complicated, so I would not like to have that in my package (though `FindFirstFunctions.jl`

is terrifying!).

Microbenchmarks can be misleading so my next test is in the real code paths of `JetReconstruction.jl`

.

2 Likes

I’m quite surprised by this myself!

In my version the second argument `n`

is ignored, which means that always the whole vector is searched. Have you changed this already? Otherwise you need something like the following:

```
function naive_findmin(w, n)
u = view(w, 1:n)
x = @fastmath foldl(min, u)
i = findfirst(==(x), u)::Int
x, i
end
```

It is currently like this:

```
function naive_findmin(dij::DenseVector{T}, n) where T
x = @fastmath foldl(min, @view dij[1:n])
i = findfirst(==(x), dij)::Int
x, i
end
```

So, yes, in the index finder the value of `n`

is ignored - but as `x`

must be in the range `1:n`

`findfirst`

will exit as soon as it finds the correct value, so it will never search beyond `n`

.

I did a quick test with your version and AFAICS the benchmark is unchanged.

And evidently the compiler does a good job with this code!

2 Likes

Apologies if this was already brought up, but I just saw this, which is the most perplexing to me yet:

```
julia> x = randn(4096);
julia> @benchmark findmin($x)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min … max): 6.067 μs … 21.583 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.083 μs ┊ GC (median): 0.00%
Time (mean ± σ): 6.226 μs ± 843.417 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▁ ▁▂ ▁
██▇▆▄▄▅▄███▇▅▄▄▃▄▄▄▁▃▄▁▄▃▄▃▃▆▆▄▆▅▃▃▅▄▃▃▃▄▃▅▄▁▅▄▄▃▅▃▃▄▄▄▄▁▄▄ █
6.07 μs Histogram: log(frequency) by time 10.9 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark minimum($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 13.600 μs … 57.200 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.700 μs ┊ GC (median): 0.00%
Time (mean ± σ): 13.996 μs ± 2.498 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂▁ ▁ ▁
███▅▅▄▆█▆▇▆▆▆▆▅▄▄▃▃▄▃▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▄▅▅ █
13.6 μs Histogram: log(frequency) by time 27.8 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
```

These results are consistent for me for many tests.

This is v1.11.1

to me there’s a cross-over point:

```
julia> x = randn(1024);
julia> @be findmin($x)
Benchmark: 3029 samples with 34 evaluations
min 861.588 ns
median 875.147 ns
mean 890.109 ns
max 4.051 μs
julia> @be minimum($x)
Benchmark: 3103 samples with 26 evaluations
min 1.024 μs
median 1.112 μs
mean 1.132 μs
max 4.352 μs
julia> x = randn(10240);
julia> @be findmin($x)
Benchmark: 3546 samples with 3 evaluations
min 8.372 μs
median 8.553 μs
mean 8.551 μs
max 12.413 μs
julia> @be minimum($x)
Benchmark: 3188 samples with 8 evaluations
min 3.334 μs
median 3.450 μs
mean 3.451 μs
max 5.810 μs
```

maybe you just need larger `N`

, larger than 4096

For me `minimum`

is slower in all cases at least up to 10^6.

```
julia> @btime findmin(x) setup=(x=randn(2^20))
1.567 ms (0 allocations: 0 bytes)
(-4.8003337992633694, 112476)
julia> @btime minimum(x) setup=(x=randn(2^20))
3.518 ms (0 allocations: 0 bytes)
-5.193793445487401
```

Not for me. Even up to 2^26 (=67M)

```
julia> @btime findmin(x) setup=(x=randn(2^26))
173.805 ms (0 allocations: 0 bytes)
(-5.225567009788859, 4044609)
julia> @btime minimum(x) setup=(x=randn(2^26))
234.388 ms (0 allocations: 0 bytes)
-5.577759186585374
```

oh ok sorry mis-read your last message, what CPU are you using?

12 × Intel(R) Core™ i7-9750H CPU @ 2.60GHz

interesting, on both Zen2 and Zen4 I see:

```
#zen2 24 × AMD Ryzen 9 3900X 12-Core Processor
julia> @btime findmin(x) setup=(x=randn(2^26))
81.600 ms (0 allocations: 0 bytes)
(-5.711457299002498, 57053893)
julia> @btime minimum(x) setup=(x=randn(2^26))
26.146 ms (0 allocations: 0 bytes)
-5.982782024669893
#zen4 16 × AMD Ryzen 7 7840HS
julia> @btime findmin(x) setup=(x=randn(2^26))
55.342 ms (0 allocations: 0 bytes)
(-5.408897097757254, 24735986)
julia> @btime minimum(x) setup=(x=randn(2^26))
17.472 ms (0 allocations: 0 bytes)
-5.696407087349619
```

can we split this thread into another `findmin() faster than minimum()`

on some CPU?

1 Like