Faster `findmin` without LoopVectorization.jl

@graeme-a-stewart has this faster alternative of Base.findmin:

julia> using LoopVectorization, Chairmarks

julia> function fast_findmin(dij, n)
           best = 1
           @inbounds dij_min = dij[1]
           @turbo for here in 2:n
               newmin = dij[here] < dij_min
               best = newmin ? here : best
               dij_min = newmin ? dij[here] : dij_min
           end
           dij_min, best
       end

julia> function basic_findmin(dij, n)
           best = 1
           @inbounds dij_min = dij[1]
           @inbounds @simd for here in 2:n
               dij_here = dij[here]
               newmin = dij_here < dij_min
               best = ifelse(newmin, here, best)
               dij_min = ifelse(newmin, dij_here, dij_min)
           end
           dij_min, best
       end

Benchmark result:

julia> function setup_bench()
           N = rand(200:500)
           dij = abs.(rand(N))
           first_n = rand(1:N)
           dij, first_n
       end

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 27793 samples with 1 evaluation
 min    60.000 ns
 median 90.000 ns
 mean   91.385 ns
 max    841.000 ns

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 25460 samples with 1 evaluation
 min    40.000 ns
 median 241.000 ns
 mean   258.936 ns
 max    10.930 μs

now wondering if this helps: efficient vector masks / reducing bitvector-mask-conversions · Issue #95 · eschnett/SIMD.jl · GitHub

I may be missing something, but the following naive approach seem to be faster than basic_findmin, at least for vectors that are not too long:

function naive_findmin(w, N...)
    x = @fastmath foldl(min, w)
    i = findfirst(==(x), w)::Int
    x, i
end

On my machine I get

julia> N = 300; v = rand(N);

julia> @b basic_findmin($v, $N)
322.393 ns

julia> @b naive_findmin($v)
133.571 ns

EDIT: The timing for naive_findmin depends on where the the minimum is. However, it seems to be always faster that basic_findmin.

In this case we do want to look at mean. You probably need to use the setup function to change length and the n. I don’t see that being faster than basic_findmin:

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 42416 samples with 1 evaluation
 min    20.000 ns
 median 40.000 ns
 mean   42.799 ns
 max    10.921 μs

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 65642 samples with 1 evaluation
 min    20.000 ns
 median 111.000 ns
 mean   121.999 ns
 max    9.778 μs

julia> @be setup_bench naive_findmin(_...) evals=1
Benchmark: 59873 samples with 1 evaluation
 min    30.000 ns
 median 111.000 ns
 mean   124.057 ns
 max    24.506 μs

On my machine naive_findmin still appears to be faster. I don’t really trust benchmark results with evals = 1. Instead I prefer a loop that calls the function multiple times. Here I do

function sample(f)
    s, i = 0.0, 0
    for _ in 1:1000
        N = rand(200:500)
        v = rand(N)
        t, j = f(v, N)   # make sure that `f` is called
        s += t
        i += j
    end
    s, i
end

I also allow a second parameter in naive_findmin (see the edit in my posting above). Then I get

julia> @b sample(basic_findmin)
1.029 ms (2790 allocs: 2.725 MiB)

julia> @b sample(naive_findmin)
843.207 μs (2806 allocs: 2.723 MiB)

julia> @b Returns((0.0, 0)) sample(_)   # no-op
563.151 μs (2794 allocs: 2.719 MiB)

If you subtract the last benchmark, naive_findmin seems to be roughly 1.7 times faster than basic_findmin.

This reports minimal time, can you try be

The medians suggest that naive_findmin is even faster than what I said before. Do you get something similar on your machine?

julia> @be sample(basic_findmin)
Benchmark: 57 samples with 1 evaluation
 min    1.035 ms (2800 allocs: 2.715 MiB)
 median 1.290 ms (2826 allocs: 2.762 MiB)
 mean   1.644 ms (2826.25 allocs: 2.764 MiB, 8.53% gc time)
 max    7.331 ms (2858 allocs: 2.812 MiB, 75.68% gc time)

julia> @be sample(naive_findmin)
Benchmark: 78 samples with 1 evaluation
 min    827.408 μs (2806 allocs: 2.704 MiB)
 median 991.615 μs (2830.50 allocs: 2.770 MiB)
 mean   1.234 ms (2828.86 allocs: 2.768 MiB, 14.51% gc time)
 max    4.987 ms (2849 allocs: 2.821 MiB, 75.35% gc time)

julia> @be Returns((0.0, 0)) sample(_)
Benchmark: 86 samples with 1 evaluation
 min    608.538 μs (2799 allocs: 2.724 MiB)
 median 893.738 μs (2830 allocs: 2.770 MiB)
 mean   1.094 ms (2830.34 allocs: 2.770 MiB, 19.76% gc time)
 max    3.811 ms (2861 allocs: 2.816 MiB, 76.31% gc time)

alright, so I think the more realistic use case is:

and I see:

julia> function sample(f::T) where T
           s, i = 0.0, 0
           N = rand(200:500)
           v = rand(N)
           for n in N:-1:1
               t, j = f(v, n)   # make sure that `f` is called
               s += t
               i += j
           end
           s, i
       end
sample (generic function with 1 method)

julia> @b sample(naive_findmin)
2.334 μs (2 allocs: 1.625 KiB)

julia> @b sample(fast_findmin)
2.295 μs (2 allocs: 1.711 KiB)

julia> @b sample(basic_findmin)
10.299 μs (2 allocs: 1.625 KiB)

Since this iterates over the collection twice, it must necessarily be suboptimal.

Of course, that’s why it’s called “naive”. But it vectorizes, so there is a good chance that it’s faster than a method that slowly iterates once. One could iterate over batches of suitable size to see if the minimum changes and then iterate a second time (to find the position) only in case it matters.

If you want to avoid LoopVectorization.jl, you could always take FindFirstFunctions.jl’s approach. :wink:

Is that like copy/ modifying LLVM output of LV.jl?

Or was it entirely manually written…

indeed, I started looking into SIMD.jl and currently have:

    using SIMD
    function sample(f::T) where T
           s, i = 0.0, 0
           N = 512
           v = rand(N)
           for n in N:-1:1
               t, j = f(v, n)   # make sure that `f` is called
               s += t
               i += j
           end
           s, i
       end
       function simd_findmin(dij::DenseVector{Float64}, n)
           lane = VecRange{8}(0)
           vmins = Vec((Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf))
           @inbounds @simd for i in 1:8:length(dij)
               vidxs = lane + i
               vdij = dij[vidxs]
               vmins = vifelse(vdij < vmins, vdij, vmins)
           end
           vmin = minimum(vmins)
           idx_min = findfirst(==(vmin), dij)::Int
           vmin, idx_min
       end

@be sample(simd_findmin)
Benchmark: 1244 samples with 1 evaluation
 min    14.064 μs (3 allocs: 4.062 KiB)
 median 72.215 μs (3 allocs: 4.062 KiB)
 mean   75.222 μs (3 allocs: 4.062 KiB, 0.08% gc time)
 max    3.863 ms (3 allocs: 4.062 KiB, 97.37% gc time)

@be sample(naive_findmin)
Benchmark: 1222 samples with 1 evaluation
 min    12.691 μs (3 allocs: 4.062 KiB)
 median 72.951 μs (3 allocs: 4.062 KiB)
 mean   76.087 μs (3 allocs: 4.062 KiB, 0.08% gc time)
 max    3.645 ms (3 allocs: 4.062 KiB, 98.08% gc time)

so it’s about as fast/slow as the naive version, now we just need to somehow put the findfirst(==(vmin), dij) as part of the SIMD loop…

A question for SIMD.jl maintainer, why do I need @simd in this case? Does that indicate some bug in my code?

Taking my implementation from Renderer/src/RayTracingInOneWeekend.jl at master · Zentrik/Renderer · GitHub,

using SIMD

function hor_min(x::SIMD.Vec{8, F}) where F
    @fastmath a = shufflevector(x, Val((4, 5, 6, 7, :undef, :undef, :undef, :undef))) # high half
    @fastmath b = min(a, x)
    @fastmath a = shufflevector(b, Val((2, 3, :undef, :undef, :undef, :undef, :undef, :undef)))
    @fastmath b = min(a, b)
    @fastmath a = shufflevector(b, Val((1, :undef, :undef, :undef, :undef, :undef, :undef, :undef)))
    @fastmath b = min(a, b)
    return @inbounds b[1]
end

@generated function getBits(mask::SIMD.Vec{N, Bool}) where N #This reverses the bits
    s = """
    %mask = trunc <$N x i8> %0 to <$N x i1>
    %res = bitcast <$N x i1> %mask to i$N
    ret i$N %res
    """
    return :(
        $(Expr(:meta, :inline));
        Base.llvmcall($s, UInt8, Tuple{SIMD.LVec{N, Bool}}, mask.data)
    )
end

function my_findmin(x)
    # Assumes x's length is a multiple of 8
    laneIndices = SIMD.Vec{8, Int64}((1, 2, 3, 4, 5, 6, 7, 8))
    minvals = SIMD.Vec{8, Float64}(Inf)
    min_indices = SIMD.Vec{8, Int64}(0)

    lane = VecRange{8}(0)
    i = 1
    @inbounds @fastmath while i <= length(x)
        predicate = x[lane + i] < minvals
        minvals = vifelse(predicate, x[lane + i], minvals)
        min_indices = vifelse(predicate, laneIndices, min_indices)

        i += 8
        laneIndices += 8
    end

    min_value = hor_min(minvals)
    min_index = min_indices[trailing_zeros(getBits(min_value == minvals)) + 1]
    return min_value, min_index
end

I get

julia> using BenchmarkTools

julia> x = rand(512);

julia> @benchmark my_findmin($(Ref(x))[]) samples=10^5
BenchmarkTools.Trial: 43356 samples with 903 evaluations.
 Range (min … max):  122.924 ns … 472.468 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     125.875 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   126.235 ns ±   6.468 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▅▁▁▁▇▆▂▃▇▅▂▂▂▂▁▁                                            ▃
  ██████████████████████▇▇▇▇▆▇▆▆▆▆▆▆▆▆▆▆▆▆▆▅▆▆▆▅▆▆▆▆▆▆▆▆▆▆▆▇▆▆▆ █
  123 ns        Histogram: log(frequency) by time        151 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> function naive_findmin(w)
           x = @fastmath foldl(min, w)
           i = findfirst(==(x), w)::Int
           x, i
       end
naive_findmin (generic function with 2 methods)

julia> @benchmark naive_findmin($(Ref(x))[]) samples=10^5
BenchmarkTools.Trial: 58706 samples with 285 evaluations.
 Range (min … max):  280.425 ns … 936.267 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     288.126 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   293.907 ns ±  23.464 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ██▄▂▃▃▅▄▂▁▁▁▁               ▁                               ▂
  █▃██████████████▇█▇▇▇▇▇▇▇▇▇▇▇▇██▇▆▆▆▆▆▆▆▆▆▅▅▆▆▆▆▅▆▅▅▄▅▄▅▄▅▅▅▄ █
  280 ns        Histogram: log(frequency) by time        426 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I don’t think extending my code to support non multiple of 8 lengths should slow things down much.
EDIT: oops, screwed up calculating the indices, not as fast as it first appeared. Though if you have avx512 it will probably be 4x faster.

is this faster than the SIMD.minimum()?

I copied the output, and then modified it/cleaned it up.

I’d also suggest trying that to see the approach LV uses, which is still faster than the other implementations here on my 10980XE (which has AVX512):

julia> @be sample(my_findmin)
Benchmark: 4263 samples with 1 evaluation
 min    18.315 μs (1 allocs: 4.125 KiB)
 median 18.607 μs (1 allocs: 4.125 KiB)
 mean   19.118 μs (1 allocs: 4.125 KiB)
 max    245.306 μs (1 allocs: 4.125 KiB)

julia> @be sample(fast_findmin)
Benchmark: 3097 samples with 2 evaluations
 min    14.210 μs (1 allocs: 4.125 KiB)
 median 14.554 μs (1 allocs: 4.125 KiB)
 mean   14.754 μs (1 allocs: 4.125 KiB)
 max    24.572 μs (1 allocs: 4.125 KiB)

julia> @be sample(naive_findmin)
Benchmark: 213 samples with 1 evaluation
 min    250.449 μs (1 allocs: 4.125 KiB)
 median 436.646 μs (1 allocs: 4.125 KiB)
 mean   426.060 μs (1 allocs: 4.125 KiB)
 max    633.860 μs (1 allocs: 4.125 KiB)

julia> @be sample(simd_findmin)
Benchmark: 688 samples with 1 evaluation
 min    35.047 μs (1 allocs: 4.125 KiB)
 median 130.514 μs (1 allocs: 4.125 KiB)
 mean   132.323 μs (1 allocs: 4.125 KiB)
 max    288.786 μs (1 allocs: 4.125 KiB)

When I wrote it, it must have been. No clue if that’s still the case or even on your hardware, feel free to test it.

ok I think I found a way to not need the hor_min and bitcast while still have similar performance, can people run this on their arch to check?

julia> function fast_findmin(dij::DenseVector{T}, n) where T
    laneIndices = SIMD.Vec{8, Int}((1, 2, 3, 4, 5, 6, 7, 8))
    minvals = SIMD.Vec{8, T}(Inf)
    min_indices = SIMD.Vec{8, Int}(0)

    n_batches, remainder = divrem(n, 8)
    lane = VecRange{8}(0)
    i = 1
    @inbounds @fastmath for _ in 1:n_batches
        dijs = dij[lane + i]
        predicate = dijs < minvals
        minvals = vifelse(predicate, dijs, minvals)
        min_indices = vifelse(predicate, laneIndices, min_indices)

        i += 8
        laneIndices += 8
    end

    min_value = SIMD.minimum(minvals)
    min_index = @inbounds min_value == minvals[1] ? min_indices[1] : min_value == minvals[2] ? min_indices[2] :
                min_value == minvals[3] ? min_indices[3] : min_value == minvals[4] ? min_indices[4] :
                min_value == minvals[5] ? min_indices[5] : min_value == minvals[6] ? min_indices[6] :
                min_value == minvals[7] ? min_indices[7] : min_indices[8]

    @inbounds @fastmath for _ in 1:remainder
        xi = dij[i]
        pred = dij[i] < min_value
        min_value= ifelse(pred, xi, min_value)
        min_index = ifelse(pred, i, min_index)
        i += 1
    end
    return min_value, min_index
end


julia> for _ = 1:100000
           N = rand(1:1000)
           x = rand(N)
           n = rand(1:N)
           @assert JetReconstruction.fast_findmin(x,n) == JetReconstruction.fast_findmin2(x,n)
       end

julia> @be sample(JetReconstruction.fast_findmin2)
Benchmark: 2789 samples with 2 evaluations
 min    12.649 μs (3 allocs: 4.062 KiB)
 median 13.861 μs (3 allocs: 4.062 KiB)
 mean   15.899 μs (3 allocs: 4.062 KiB, 0.14% gc time)
 max    1.519 ms (3 allocs: 4.062 KiB, 97.72% gc time)

julia> @be sample(JetReconstruction.fast_findmin)
Benchmark: 3050 samples with 2 evaluations
 min    12.133 μs (3 allocs: 4.062 KiB)
 median 12.589 μs (3 allocs: 4.062 KiB)
 mean   14.980 μs (3 allocs: 4.062 KiB, 0.13% gc time)
 max    1.941 ms (3 allocs: 4.062 KiB, 98.62% gc time)

Are you sure this doesn’t do two vloads?

Also, why switch to scalar at the end? Can’t you just end with loading the last 8 elements (after checking that the n_batches is greater than 8, and remainder is nonzero)? It’s my understanding that switching between simd an scalar has an extra cost, so it’s better to stay in simd.

not sure, let me fix it!

good point (I think the condition is n_batches > 0 and remainder != 0. But I think scalar is good enough for now, since it will complicate the function by quite a bit (you still need scalar version for when n < 8 which happens