Faster `findmin` without LoopVectorization.jl

@graeme-a-stewart has this faster alternative of Base.findmin:

julia> using LoopVectorization, Chairmarks

julia> function fast_findmin(dij, n)
           best = 1
           @inbounds dij_min = dij[1]
           @turbo for here in 2:n
               newmin = dij[here] < dij_min
               best = newmin ? here : best
               dij_min = newmin ? dij[here] : dij_min
           end
           dij_min, best
       end

julia> function basic_findmin(dij, n)
           best = 1
           @inbounds dij_min = dij[1]
           @inbounds @simd for here in 2:n
               dij_here = dij[here]
               newmin = dij_here < dij_min
               best = ifelse(newmin, here, best)
               dij_min = ifelse(newmin, dij_here, dij_min)
           end
           dij_min, best
       end

Benchmark result:

julia> function setup_bench()
           N = rand(200:500)
           dij = abs.(rand(N))
           first_n = rand(1:N)
           dij, first_n
       end

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 27793 samples with 1 evaluation
 min    60.000 ns
 median 90.000 ns
 mean   91.385 ns
 max    841.000 ns

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 25460 samples with 1 evaluation
 min    40.000 ns
 median 241.000 ns
 mean   258.936 ns
 max    10.930 μs
1 Like

now wondering if this helps: efficient vector masks / reducing bitvector-mask-conversions · Issue #95 · eschnett/SIMD.jl · GitHub

I may be missing something, but the following naive approach seem to be faster than basic_findmin, at least for vectors that are not too long:

function naive_findmin(w, N...)
    x = @fastmath foldl(min, w)
    i = findfirst(==(x), w)::Int
    x, i
end

On my machine I get

julia> N = 300; v = rand(N);

julia> @b basic_findmin($v, $N)
322.393 ns

julia> @b naive_findmin($v)
133.571 ns

EDIT: The timing for naive_findmin depends on where the the minimum is. However, it seems to be always faster that basic_findmin.

In this case we do want to look at mean. You probably need to use the setup function to change length and the n. I don’t see that being faster than basic_findmin:

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 42416 samples with 1 evaluation
 min    20.000 ns
 median 40.000 ns
 mean   42.799 ns
 max    10.921 μs

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 65642 samples with 1 evaluation
 min    20.000 ns
 median 111.000 ns
 mean   121.999 ns
 max    9.778 μs

julia> @be setup_bench naive_findmin(_...) evals=1
Benchmark: 59873 samples with 1 evaluation
 min    30.000 ns
 median 111.000 ns
 mean   124.057 ns
 max    24.506 μs
1 Like

On my machine naive_findmin still appears to be faster. I don’t really trust benchmark results with evals = 1. Instead I prefer a loop that calls the function multiple times. Here I do

function sample(f)
    s, i = 0.0, 0
    for _ in 1:1000
        N = rand(200:500)
        v = rand(N)
        t, j = f(v, N)   # make sure that `f` is called
        s += t
        i += j
    end
    s, i
end

I also allow a second parameter in naive_findmin (see the edit in my posting above). Then I get

julia> @b sample(basic_findmin)
1.029 ms (2790 allocs: 2.725 MiB)

julia> @b sample(naive_findmin)
843.207 μs (2806 allocs: 2.723 MiB)

julia> @b Returns((0.0, 0)) sample(_)   # no-op
563.151 μs (2794 allocs: 2.719 MiB)

If you subtract the last benchmark, naive_findmin seems to be roughly 1.7 times faster than basic_findmin.

This reports minimal time, can you try be

The medians suggest that naive_findmin is even faster than what I said before. Do you get something similar on your machine?

julia> @be sample(basic_findmin)
Benchmark: 57 samples with 1 evaluation
 min    1.035 ms (2800 allocs: 2.715 MiB)
 median 1.290 ms (2826 allocs: 2.762 MiB)
 mean   1.644 ms (2826.25 allocs: 2.764 MiB, 8.53% gc time)
 max    7.331 ms (2858 allocs: 2.812 MiB, 75.68% gc time)

julia> @be sample(naive_findmin)
Benchmark: 78 samples with 1 evaluation
 min    827.408 μs (2806 allocs: 2.704 MiB)
 median 991.615 μs (2830.50 allocs: 2.770 MiB)
 mean   1.234 ms (2828.86 allocs: 2.768 MiB, 14.51% gc time)
 max    4.987 ms (2849 allocs: 2.821 MiB, 75.35% gc time)

julia> @be Returns((0.0, 0)) sample(_)
Benchmark: 86 samples with 1 evaluation
 min    608.538 μs (2799 allocs: 2.724 MiB)
 median 893.738 μs (2830 allocs: 2.770 MiB)
 mean   1.094 ms (2830.34 allocs: 2.770 MiB, 19.76% gc time)
 max    3.811 ms (2861 allocs: 2.816 MiB, 76.31% gc time)

alright, so I think the more realistic use case is:

and I see:

julia> function sample(f::T) where T
           s, i = 0.0, 0
           N = rand(200:500)
           v = rand(N)
           for n in N:-1:1
               t, j = f(v, n)   # make sure that `f` is called
               s += t
               i += j
           end
           s, i
       end
sample (generic function with 1 method)

julia> @b sample(naive_findmin)
2.334 μs (2 allocs: 1.625 KiB)

julia> @b sample(fast_findmin)
2.295 μs (2 allocs: 1.711 KiB)

julia> @b sample(basic_findmin)
10.299 μs (2 allocs: 1.625 KiB)

Since this iterates over the collection twice, it must necessarily be suboptimal.

Of course, that’s why it’s called “naive”. But it vectorizes, so there is a good chance that it’s faster than a method that slowly iterates once. One could iterate over batches of suitable size to see if the minimum changes and then iterate a second time (to find the position) only in case it matters.

If you want to avoid LoopVectorization.jl, you could always take FindFirstFunctions.jl’s approach. :wink:

4 Likes

Is that like copy/ modifying LLVM output of LV.jl?

Or was it entirely manually written…

indeed, I started looking into SIMD.jl and currently have:

    using SIMD
    function sample(f::T) where T
           s, i = 0.0, 0
           N = 512
           v = rand(N)
           for n in N:-1:1
               t, j = f(v, n)   # make sure that `f` is called
               s += t
               i += j
           end
           s, i
       end
       function simd_findmin(dij::DenseVector{Float64}, n)
           lane = VecRange{8}(0)
           vmins = Vec((Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf))
           @inbounds @simd for i in 1:8:length(dij)
               vidxs = lane + i
               vdij = dij[vidxs]
               vmins = vifelse(vdij < vmins, vdij, vmins)
           end
           vmin = minimum(vmins)
           idx_min = findfirst(==(vmin), dij)::Int
           vmin, idx_min
       end

@be sample(simd_findmin)
Benchmark: 1244 samples with 1 evaluation
 min    14.064 μs (3 allocs: 4.062 KiB)
 median 72.215 μs (3 allocs: 4.062 KiB)
 mean   75.222 μs (3 allocs: 4.062 KiB, 0.08% gc time)
 max    3.863 ms (3 allocs: 4.062 KiB, 97.37% gc time)

@be sample(naive_findmin)
Benchmark: 1222 samples with 1 evaluation
 min    12.691 μs (3 allocs: 4.062 KiB)
 median 72.951 μs (3 allocs: 4.062 KiB)
 mean   76.087 μs (3 allocs: 4.062 KiB, 0.08% gc time)
 max    3.645 ms (3 allocs: 4.062 KiB, 98.08% gc time)

so it’s about as fast/slow as the naive version, now we just need to somehow put the findfirst(==(vmin), dij) as part of the SIMD loop…

A question for SIMD.jl maintainer, why do I need @simd in this case? Does that indicate some bug in my code?

Taking my implementation from Renderer/src/RayTracingInOneWeekend.jl at master · Zentrik/Renderer · GitHub,

using SIMD

function hor_min(x::SIMD.Vec{8, F}) where F
    @fastmath a = shufflevector(x, Val((4, 5, 6, 7, :undef, :undef, :undef, :undef))) # high half
    @fastmath b = min(a, x)
    @fastmath a = shufflevector(b, Val((2, 3, :undef, :undef, :undef, :undef, :undef, :undef)))
    @fastmath b = min(a, b)
    @fastmath a = shufflevector(b, Val((1, :undef, :undef, :undef, :undef, :undef, :undef, :undef)))
    @fastmath b = min(a, b)
    return @inbounds b[1]
end

@generated function getBits(mask::SIMD.Vec{N, Bool}) where N #This reverses the bits
    s = """
    %mask = trunc <$N x i8> %0 to <$N x i1>
    %res = bitcast <$N x i1> %mask to i$N
    ret i$N %res
    """
    return :(
        $(Expr(:meta, :inline));
        Base.llvmcall($s, UInt8, Tuple{SIMD.LVec{N, Bool}}, mask.data)
    )
end

function my_findmin(x)
    # Assumes x's length is a multiple of 8
    laneIndices = SIMD.Vec{8, Int64}((1, 2, 3, 4, 5, 6, 7, 8))
    minvals = SIMD.Vec{8, Float64}(Inf)
    min_indices = SIMD.Vec{8, Int64}(0)

    lane = VecRange{8}(0)
    i = 1
    @inbounds @fastmath while i <= length(x)
        predicate = x[lane + i] < minvals
        minvals = vifelse(predicate, x[lane + i], minvals)
        min_indices = vifelse(predicate, laneIndices, min_indices)

        i += 8
        laneIndices += 8
    end

    min_value = hor_min(minvals)
    min_index = min_indices[trailing_zeros(getBits(min_value == minvals)) + 1]
    return min_value, min_index
end

I get

julia> using BenchmarkTools

julia> x = rand(512);

julia> @benchmark my_findmin($(Ref(x))[]) samples=10^5
BenchmarkTools.Trial: 43356 samples with 903 evaluations.
 Range (min … max):  122.924 ns … 472.468 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     125.875 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   126.235 ns ±   6.468 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▅▁▁▁▇▆▂▃▇▅▂▂▂▂▁▁                                            ▃
  ██████████████████████▇▇▇▇▆▇▆▆▆▆▆▆▆▆▆▆▆▆▆▅▆▆▆▅▆▆▆▆▆▆▆▆▆▆▆▇▆▆▆ █
  123 ns        Histogram: log(frequency) by time        151 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> function naive_findmin(w)
           x = @fastmath foldl(min, w)
           i = findfirst(==(x), w)::Int
           x, i
       end
naive_findmin (generic function with 2 methods)

julia> @benchmark naive_findmin($(Ref(x))[]) samples=10^5
BenchmarkTools.Trial: 58706 samples with 285 evaluations.
 Range (min … max):  280.425 ns … 936.267 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     288.126 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   293.907 ns ±  23.464 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ██▄▂▃▃▅▄▂▁▁▁▁               ▁                               ▂
  █▃██████████████▇█▇▇▇▇▇▇▇▇▇▇▇▇██▇▆▆▆▆▆▆▆▆▆▅▅▆▆▆▆▅▆▅▅▄▅▄▅▄▅▅▅▄ █
  280 ns        Histogram: log(frequency) by time        426 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I don’t think extending my code to support non multiple of 8 lengths should slow things down much.
EDIT: oops, screwed up calculating the indices, not as fast as it first appeared. Though if you have avx512 it will probably be 4x faster.

2 Likes

is this faster than the SIMD.minimum()?

I copied the output, and then modified it/cleaned it up.

I’d also suggest trying that to see the approach LV uses, which is still faster than the other implementations here on my 10980XE (which has AVX512):

julia> @be sample(my_findmin)
Benchmark: 4263 samples with 1 evaluation
 min    18.315 μs (1 allocs: 4.125 KiB)
 median 18.607 μs (1 allocs: 4.125 KiB)
 mean   19.118 μs (1 allocs: 4.125 KiB)
 max    245.306 μs (1 allocs: 4.125 KiB)

julia> @be sample(fast_findmin)
Benchmark: 3097 samples with 2 evaluations
 min    14.210 μs (1 allocs: 4.125 KiB)
 median 14.554 μs (1 allocs: 4.125 KiB)
 mean   14.754 μs (1 allocs: 4.125 KiB)
 max    24.572 μs (1 allocs: 4.125 KiB)

julia> @be sample(naive_findmin)
Benchmark: 213 samples with 1 evaluation
 min    250.449 μs (1 allocs: 4.125 KiB)
 median 436.646 μs (1 allocs: 4.125 KiB)
 mean   426.060 μs (1 allocs: 4.125 KiB)
 max    633.860 μs (1 allocs: 4.125 KiB)

julia> @be sample(simd_findmin)
Benchmark: 688 samples with 1 evaluation
 min    35.047 μs (1 allocs: 4.125 KiB)
 median 130.514 μs (1 allocs: 4.125 KiB)
 mean   132.323 μs (1 allocs: 4.125 KiB)
 max    288.786 μs (1 allocs: 4.125 KiB)
2 Likes

When I wrote it, it must have been. No clue if that’s still the case or even on your hardware, feel free to test it.

ok I think I found a way to not need the hor_min and bitcast while still have similar performance, can people run this on their arch to check?

julia> function fast_findmin(dij::DenseVector{T}, n) where T
    laneIndices = SIMD.Vec{8, Int}((1, 2, 3, 4, 5, 6, 7, 8))
    minvals = SIMD.Vec{8, T}(Inf)
    min_indices = SIMD.Vec{8, Int}(0)

    n_batches, remainder = divrem(n, 8)
    lane = VecRange{8}(0)
    i = 1
    @inbounds @fastmath for _ in 1:n_batches
        dijs = dij[lane + i]
        predicate = dijs < minvals
        minvals = vifelse(predicate, dijs, minvals)
        min_indices = vifelse(predicate, laneIndices, min_indices)

        i += 8
        laneIndices += 8
    end

    min_value = SIMD.minimum(minvals)
    min_index = @inbounds min_value == minvals[1] ? min_indices[1] : min_value == minvals[2] ? min_indices[2] :
                min_value == minvals[3] ? min_indices[3] : min_value == minvals[4] ? min_indices[4] :
                min_value == minvals[5] ? min_indices[5] : min_value == minvals[6] ? min_indices[6] :
                min_value == minvals[7] ? min_indices[7] : min_indices[8]

    @inbounds @fastmath for _ in 1:remainder
        xi = dij[i]
        pred = dij[i] < min_value
        min_value= ifelse(pred, xi, min_value)
        min_index = ifelse(pred, i, min_index)
        i += 1
    end
    return min_value, min_index
end


julia> for _ = 1:100000
           N = rand(1:1000)
           x = rand(N)
           n = rand(1:N)
           @assert JetReconstruction.fast_findmin(x,n) == JetReconstruction.fast_findmin2(x,n)
       end

julia> @be sample(JetReconstruction.fast_findmin2)
Benchmark: 2789 samples with 2 evaluations
 min    12.649 μs (3 allocs: 4.062 KiB)
 median 13.861 μs (3 allocs: 4.062 KiB)
 mean   15.899 μs (3 allocs: 4.062 KiB, 0.14% gc time)
 max    1.519 ms (3 allocs: 4.062 KiB, 97.72% gc time)

julia> @be sample(JetReconstruction.fast_findmin)
Benchmark: 3050 samples with 2 evaluations
 min    12.133 μs (3 allocs: 4.062 KiB)
 median 12.589 μs (3 allocs: 4.062 KiB)
 mean   14.980 μs (3 allocs: 4.062 KiB, 0.13% gc time)
 max    1.941 ms (3 allocs: 4.062 KiB, 98.62% gc time)

Are you sure this doesn’t do two vloads?

Also, why switch to scalar at the end? Can’t you just end with loading the last 8 elements (after checking that the n_batches is greater than 8, and remainder is nonzero)? It’s my understanding that switching between simd an scalar has an extra cost, so it’s better to stay in simd.

1 Like

not sure, let me fix it!

good point (I think the condition is n_batches > 0 and remainder != 0. But I think scalar is good enough for now, since it will complicate the function by quite a bit (you still need scalar version for when n < 8 which happens