Faster `findmin` without LoopVectorization.jl

Zentrik · October 27, 2024, 10:54pm

Yes, llvm is fairly smart in general.

Also, my implementation can be made a bit faster by inlining hor_min and adding @inbounds to the calculation of min_index.

DNF · October 27, 2024, 10:58pm

But does smart help here? What guarantee is there that x hasn’t been modified between the loads? Doesn’t the code require a new load, for correctness?

jling · October 27, 2024, 10:59pm

fixed

foobar_lv2 · October 28, 2024, 1:22am

Be careful with that. LLVM will silently generate nonsense code for non-power-of-2 sizes. Afaiu there is no consensus in the llvm community whether this is a bug or whether this idiom for pmovmskb only works coincidentally. So you need pretty comprehensive unit tests for that, in order to yell if an llvm update breaks this.

The relevant point of contention appears to be that the bitcast langref refers to the “in memory representation” of the datatypes, but i1 has no in-memory representation. Yes, this is very silly and annoying. I hate language-lawyering with a passion.

graeme-a-stewart · October 28, 2024, 2:52pm

Thanks @jling for bringing this up. I have been very interested in the discussion.

Just to give a bit of context here, in this problem the metric dij is calculated for a set of clusters (usually 300-500 at the beginning) and each iteration we need to find this minimum value. Then two clusters are merged to one, and some of the dij values are updated. The next minimum is found, a merger happens, and so on, until the process finishes.

This is why the realistic test is to take a random array of length, say, 400. Find the minimum, then reduce to 399, 398, 397, …, 1. Adding some update to the array values each iteration makes it even more realistic (prevents unrealistic caching).

This is my wrapper to test any findmin implementation f on vector v:

function run_descent(v::DenseVector{Float64}, f::T; perturb = 0) where T
    # Ensure we do something with the calculation to prevent the
    # compiler from optimizing everything away!
    sum = 0.0
    for n in length(v):-1:2
        val, _ = f(v, n)
        sum += val
        # If one wants to perturb the array do it like this, which
        # is a proxy for changing values as the algorithm progresses.
        for _ in 1:min(perturb,n)
            v[rand(1:n)] = abs(rand())
        end
    end
    sum
end

I introduced the trick with LoopVectorisation when I realised that the out-of-the-box Julia findmin was really slow. However, now I am realising that even just a basic naive loop is a lot faster than findmin, so that a lot of the speed up may have nothing to do with vectorisation.

I wrapped up all of the suggestions so far into a script and this is what I get on two machines:

OS X 14.7, M2 Pro

Running the benchmark for an array of length 400
fast_findmin min: 25.333000000000002 μs; mean: 26.46452926525524 μs
basic_findmin min: 83.25 μs; mean: 85.91027592592592 μs
julia_findmin min: 360.791 μs; mean: 366.5392078431371 μs
naive_findmin min: 23.167 μs; mean: 36.22719819819824 μs
naive_findmin_reduce min: 24.875 μs; mean: 36.793294815987366 μs
fast_findmin_simd min: 42.291000000000004 μs; mean: 44.39575992348151 μs

Alma9, Ryzen 7 5700G

Running the benchmark for an array of length 400
fast_findmin min: 16.721 μs; mean: 16.934860248447162 μs
basic_findmin min: 22.782 μs; mean: 23.006807827395896 μs
julia_findmin min: 82.84500000000001 μs; mean: 88.04729608404968 μs
naive_findmin min: 17.352000000000004 μs; mean: 33.093611011154984 μs
naive_findmin_reduce min: 18.884999999999998 μs; mean: 34.14370736589271 μs
fast_findmin_simd min: 16.54 μs; mean: 18.151618091451382 μs

The platform differences are revealing, particularly that the basic_findmin is quite bad on the M2, but performs well on the x86_64. Julia’s findmin is particularly bad on M2.

Nothing is quite beating the LoopVectorization version fast_findmin, but naive_findmin is quite respectable on both platforms, though slower than basic_findmin on x86. fast_findmin_simd is good on x86, not great on M2 (which only has neon, as opposed to avx2).

fast_findmin_simd is also pretty complicated, so I would not like to have that in my package (though FindFirstFunctions.jl is terrifying!).

Microbenchmarks can be misleading so my next test is in the real code paths of JetReconstruction.jl.

matthias314 · October 28, 2024, 5:26pm

I’m quite surprised by this myself!

In my version the second argument n is ignored, which means that always the whole vector is searched. Have you changed this already? Otherwise you need something like the following:

function naive_findmin(w, n)
    u = view(w, 1:n)
    x = @fastmath foldl(min, u)
    i = findfirst(==(x), u)::Int
    x, i
end

graeme-a-stewart · October 28, 2024, 5:57pm

It is currently like this:

function naive_findmin(dij::DenseVector{T}, n) where T
    x = @fastmath foldl(min, @view dij[1:n])
    i = findfirst(==(x), dij)::Int
    x, i
end

So, yes, in the index finder the value of n is ignored - but as x must be in the range 1:n findfirst will exit as soon as it finds the correct value, so it will never search beyond n.

I did a quick test with your version and AFAICS the benchmark is unchanged.

And evidently the compiler does a good job with this code!

DNF · October 30, 2024, 7:55am

Apologies if this was already brought up, but I just saw this, which is the most perplexing to me yet:

julia> x = randn(4096);

julia> @benchmark findmin($x)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  6.067 μs …  21.583 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.083 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.226 μs ± 843.417 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁      ▁▂                                                  ▁
  ██▇▆▄▄▅▄███▇▅▄▄▃▄▄▄▁▃▄▁▄▃▄▃▃▆▆▄▆▅▃▃▅▄▃▃▃▄▃▅▄▁▅▄▄▃▅▃▃▄▄▄▄▁▄▄ █
  6.07 μs      Histogram: log(frequency) by time      10.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark minimum($x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.600 μs … 57.200 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.700 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.996 μs ±  2.498 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂▁    ▁                                                    ▁
  ███▅▅▄▆█▆▇▆▆▆▆▅▄▄▃▃▄▃▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▄▅▅ █
  13.6 μs      Histogram: log(frequency) by time      27.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results are consistent for me for many tests.

This is v1.11.1

jling · October 30, 2024, 7:57am

to me there’s a cross-over point:

julia> x = randn(1024);

julia> @be findmin($x)
Benchmark: 3029 samples with 34 evaluations
 min    861.588 ns
 median 875.147 ns
 mean   890.109 ns
 max    4.051 μs

julia> @be minimum($x)
Benchmark: 3103 samples with 26 evaluations
 min    1.024 μs
 median 1.112 μs
 mean   1.132 μs
 max    4.352 μs

julia> x = randn(10240);

julia> @be findmin($x)
Benchmark: 3546 samples with 3 evaluations
 min    8.372 μs
 median 8.553 μs
 mean   8.551 μs
 max    12.413 μs

julia> @be minimum($x)
Benchmark: 3188 samples with 8 evaluations
 min    3.334 μs
 median 3.450 μs
 mean   3.451 μs
 max    5.810 μs

maybe you just need larger N, larger than 4096

DNF · October 30, 2024, 8:03am

For me minimum is slower in all cases at least up to 10^6.

julia> @btime findmin(x) setup=(x=randn(2^20))
  1.567 ms (0 allocations: 0 bytes)
(-4.8003337992633694, 112476)

julia> @btime minimum(x) setup=(x=randn(2^20))
  3.518 ms (0 allocations: 0 bytes)
-5.193793445487401

DNF · October 30, 2024, 8:09am

Not for me. Even up to 2^26 (=67M)

julia> @btime findmin(x) setup=(x=randn(2^26))
  173.805 ms (0 allocations: 0 bytes)
(-5.225567009788859, 4044609)

julia> @btime minimum(x) setup=(x=randn(2^26))
  234.388 ms (0 allocations: 0 bytes)
-5.577759186585374

jling · October 30, 2024, 8:10am

oh ok sorry mis-read your last message, what CPU are you using?

DNF · October 30, 2024, 8:11am

12 × Intel(R) Core™ i7-9750H CPU @ 2.60GHz

jling · October 30, 2024, 8:14am

interesting, on both Zen2 and Zen4 I see:

#zen2 24 × AMD Ryzen 9 3900X 12-Core Processor
julia> @btime findmin(x) setup=(x=randn(2^26))
  81.600 ms (0 allocations: 0 bytes)
(-5.711457299002498, 57053893)

julia> @btime minimum(x) setup=(x=randn(2^26))
  26.146 ms (0 allocations: 0 bytes)
-5.982782024669893

#zen4 16 × AMD Ryzen 7 7840HS
julia> @btime findmin(x) setup=(x=randn(2^26))
  55.342 ms (0 allocations: 0 bytes)
(-5.408897097757254, 24735986)

julia> @btime minimum(x) setup=(x=randn(2^26))
  17.472 ms (0 allocations: 0 bytes)
-5.696407087349619

jling · October 30, 2024, 8:15am

can we split this thread into another findmin() faster than minimum() on some CPU?

jling · May 14, 2025, 4:06pm

still unsolved mystery as to how to beat this on ARM CPUs:

julia +1.11 --project ./fast_findmin.jl
Running the benchmark for an array of length 400
fast_findmin min: 17.416 μs; mean: 19.657527783558695 μs
basic_findmin min: 33.166000000000004 μs; mean: 36.25840575815738 μs
julia_findmin min: 281.458 μs; mean: 284.26158507462685 μs
naive_findmin min: 19.958000000000002 μs; mean: 30.39330613577021 μs
naive_findmin_reduce min: 19.291000000000004 μs; mean: 31.847532135452667 μs
fast_findmin_simd min: 26.416000000000004 μs; mean: 27.487557860900107 μs

julia +1.12 --project ./fast_findmin.jl
Running the benchmark for an array of length 400
fast_findmin min: 17.708000000000002 μs; mean: 19.686162043795573 μs
basic_findmin min: 32.833000000000006 μs; mean: 35.750020524515385 μs
julia_findmin min: 66.625 μs; mean: 75.47836443381202 μs
naive_findmin min: 19.750000000000004 μs; mean: 31.153999662618062 μs
naive_findmin_reduce min: 21.292 μs; mean: 32.615595032769946 μs
fast_findmin_simd min: 24.208000000000002 μs; mean: 27.420048189252515 μs

julia +nightly --project ./fast_findmin.jl
Running the benchmark for an array of length 400
fast_findmin min: 14.375 μs; mean: 16.063016917293268 μs
basic_findmin min: 61.58300000000001 μs; mean: 62.543811184210746 μs
julia_findmin min: 393.75 μs; mean: 396.7150583333332 μs
naive_findmin min: 16.708 μs; mean: 26.60740660919539 μs
naive_findmin_reduce min: 17.375000000000004 μs; mean: 27.192220903300385 μs
fast_findmin_simd min: 22.875 μs; mean: 24.201522867420415 μs

Topic		Replies	Views
`ifelse(x<y, x, y)` much faster than `min()` Performance question	16	383	October 28, 2024
Can this loop be optimized further? Performance question	7	738	December 20, 2019
Fastest algorithm for (xi - xj) * (yi - yj) for i in 1:n, j in 1:n Performance question , performance	13	378	December 1, 2023
`minimum()` 3x - 6x slower than `numpy.min()` Performance question	16	526	November 2, 2024
Help speedup a tight triple loop Performance	7	745	June 18, 2020

Faster `findmin` without LoopVectorization.jl

OS X 14.7, M2 Pro

Alma9, Ryzen 7 5700G

Related topics