Faster `findmin` without LoopVectorization.jl

@graeme-a-stewart has this faster alternative of Base.findmin:

julia> using LoopVectorization, Chairmarks

julia> function fast_findmin(dij, n)
           best = 1
           @inbounds dij_min = dij[1]
           @turbo for here in 2:n
               newmin = dij[here] < dij_min
               best = newmin ? here : best
               dij_min = newmin ? dij[here] : dij_min
           end
           dij_min, best
       end

julia> function basic_findmin(dij, n)
           best = 1
           @inbounds dij_min = dij[1]
           @inbounds @simd for here in 2:n
               dij_here = dij[here]
               newmin = dij_here < dij_min
               best = ifelse(newmin, here, best)
               dij_min = ifelse(newmin, dij_here, dij_min)
           end
           dij_min, best
       end

Benchmark result:

julia> function setup_bench()
           N = rand(200:500)
           dij = abs.(rand(N))
           first_n = rand(1:N)
           dij, first_n
       end

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 27793 samples with 1 evaluation
 min    60.000 ns
 median 90.000 ns
 mean   91.385 ns
 max    841.000 ns

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 25460 samples with 1 evaluation
 min    40.000 ns
 median 241.000 ns
 mean   258.936 ns
 max    10.930 ΞΌs
1 Like

now wondering if this helps: efficient vector masks / reducing bitvector-mask-conversions Β· Issue #95 Β· eschnett/SIMD.jl Β· GitHub

I may be missing something, but the following naive approach seem to be faster than basic_findmin, at least for vectors that are not too long:

function naive_findmin(w, N...)
    x = @fastmath foldl(min, w)
    i = findfirst(==(x), w)::Int
    x, i
end

On my machine I get

julia> N = 300; v = rand(N);

julia> @b basic_findmin($v, $N)
322.393 ns

julia> @b naive_findmin($v)
133.571 ns

EDIT: The timing for naive_findmin depends on where the the minimum is. However, it seems to be always faster that basic_findmin.

In this case we do want to look at mean. You probably need to use the setup function to change length and the n. I don’t see that being faster than basic_findmin:

julia> @be setup_bench fast_findmin(_...) evals=1
Benchmark: 42416 samples with 1 evaluation
 min    20.000 ns
 median 40.000 ns
 mean   42.799 ns
 max    10.921 ΞΌs

julia> @be setup_bench basic_findmin(_...) evals=1
Benchmark: 65642 samples with 1 evaluation
 min    20.000 ns
 median 111.000 ns
 mean   121.999 ns
 max    9.778 ΞΌs

julia> @be setup_bench naive_findmin(_...) evals=1
Benchmark: 59873 samples with 1 evaluation
 min    30.000 ns
 median 111.000 ns
 mean   124.057 ns
 max    24.506 ΞΌs
1 Like

On my machine naive_findmin still appears to be faster. I don’t really trust benchmark results with evals = 1. Instead I prefer a loop that calls the function multiple times. Here I do

function sample(f)
    s, i = 0.0, 0
    for _ in 1:1000
        N = rand(200:500)
        v = rand(N)
        t, j = f(v, N)   # make sure that `f` is called
        s += t
        i += j
    end
    s, i
end

I also allow a second parameter in naive_findmin (see the edit in my posting above). Then I get

julia> @b sample(basic_findmin)
1.029 ms (2790 allocs: 2.725 MiB)

julia> @b sample(naive_findmin)
843.207 ΞΌs (2806 allocs: 2.723 MiB)

julia> @b Returns((0.0, 0)) sample(_)   # no-op
563.151 ΞΌs (2794 allocs: 2.719 MiB)

If you subtract the last benchmark, naive_findmin seems to be roughly 1.7 times faster than basic_findmin.

This reports minimal time, can you try be

The medians suggest that naive_findmin is even faster than what I said before. Do you get something similar on your machine?

julia> @be sample(basic_findmin)
Benchmark: 57 samples with 1 evaluation
 min    1.035 ms (2800 allocs: 2.715 MiB)
 median 1.290 ms (2826 allocs: 2.762 MiB)
 mean   1.644 ms (2826.25 allocs: 2.764 MiB, 8.53% gc time)
 max    7.331 ms (2858 allocs: 2.812 MiB, 75.68% gc time)

julia> @be sample(naive_findmin)
Benchmark: 78 samples with 1 evaluation
 min    827.408 ΞΌs (2806 allocs: 2.704 MiB)
 median 991.615 ΞΌs (2830.50 allocs: 2.770 MiB)
 mean   1.234 ms (2828.86 allocs: 2.768 MiB, 14.51% gc time)
 max    4.987 ms (2849 allocs: 2.821 MiB, 75.35% gc time)

julia> @be Returns((0.0, 0)) sample(_)
Benchmark: 86 samples with 1 evaluation
 min    608.538 ΞΌs (2799 allocs: 2.724 MiB)
 median 893.738 ΞΌs (2830 allocs: 2.770 MiB)
 mean   1.094 ms (2830.34 allocs: 2.770 MiB, 19.76% gc time)
 max    3.811 ms (2861 allocs: 2.816 MiB, 76.31% gc time)

alright, so I think the more realistic use case is:

and I see:

julia> function sample(f::T) where T
           s, i = 0.0, 0
           N = rand(200:500)
           v = rand(N)
           for n in N:-1:1
               t, j = f(v, n)   # make sure that `f` is called
               s += t
               i += j
           end
           s, i
       end
sample (generic function with 1 method)

julia> @b sample(naive_findmin)
2.334 ΞΌs (2 allocs: 1.625 KiB)

julia> @b sample(fast_findmin)
2.295 ΞΌs (2 allocs: 1.711 KiB)

julia> @b sample(basic_findmin)
10.299 ΞΌs (2 allocs: 1.625 KiB)

Since this iterates over the collection twice, it must necessarily be suboptimal.

Of course, that’s why it’s called β€œnaive”. But it vectorizes, so there is a good chance that it’s faster than a method that slowly iterates once. One could iterate over batches of suitable size to see if the minimum changes and then iterate a second time (to find the position) only in case it matters.

If you want to avoid LoopVectorization.jl, you could always take FindFirstFunctions.jl’s approach. :wink:

2 Likes

Is that like copy/ modifying LLVM output of LV.jl?

Or was it entirely manually written…

indeed, I started looking into SIMD.jl and currently have:

    using SIMD
    function sample(f::T) where T
           s, i = 0.0, 0
           N = 512
           v = rand(N)
           for n in N:-1:1
               t, j = f(v, n)   # make sure that `f` is called
               s += t
               i += j
           end
           s, i
       end
       function simd_findmin(dij::DenseVector{Float64}, n)
           lane = VecRange{8}(0)
           vmins = Vec((Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf))
           @inbounds @simd for i in 1:8:length(dij)
               vidxs = lane + i
               vdij = dij[vidxs]
               vmins = vifelse(vdij < vmins, vdij, vmins)
           end
           vmin = minimum(vmins)
           idx_min = findfirst(==(vmin), dij)::Int
           vmin, idx_min
       end

@be sample(simd_findmin)
Benchmark: 1244 samples with 1 evaluation
 min    14.064 ΞΌs (3 allocs: 4.062 KiB)
 median 72.215 ΞΌs (3 allocs: 4.062 KiB)
 mean   75.222 ΞΌs (3 allocs: 4.062 KiB, 0.08% gc time)
 max    3.863 ms (3 allocs: 4.062 KiB, 97.37% gc time)

@be sample(naive_findmin)
Benchmark: 1222 samples with 1 evaluation
 min    12.691 ΞΌs (3 allocs: 4.062 KiB)
 median 72.951 ΞΌs (3 allocs: 4.062 KiB)
 mean   76.087 ΞΌs (3 allocs: 4.062 KiB, 0.08% gc time)
 max    3.645 ms (3 allocs: 4.062 KiB, 98.08% gc time)

so it’s about as fast/slow as the naive version, now we just need to somehow put the findfirst(==(vmin), dij) as part of the SIMD loop…

A question for SIMD.jl maintainer, why do I need @simd in this case? Does that indicate some bug in my code?

(post deleted by author)

Taking my implementation from Renderer/src/RayTracingInOneWeekend.jl at master Β· Zentrik/Renderer Β· GitHub,

using SIMD

function hor_min(x::SIMD.Vec{8, F}) where F
    @fastmath a = shufflevector(x, Val((4, 5, 6, 7, :undef, :undef, :undef, :undef))) # high half
    @fastmath b = min(a, x)
    @fastmath a = shufflevector(b, Val((2, 3, :undef, :undef, :undef, :undef, :undef, :undef)))
    @fastmath b = min(a, b)
    @fastmath a = shufflevector(b, Val((1, :undef, :undef, :undef, :undef, :undef, :undef, :undef)))
    @fastmath b = min(a, b)
    return @inbounds b[1]
end

@generated function getBits(mask::SIMD.Vec{N, Bool}) where N #This reverses the bits
    s = """
    %mask = trunc <$N x i8> %0 to <$N x i1>
    %res = bitcast <$N x i1> %mask to i$N
    ret i$N %res
    """
    return :(
        $(Expr(:meta, :inline));
        Base.llvmcall($s, UInt8, Tuple{SIMD.LVec{N, Bool}}, mask.data)
    )
end

function my_findmin(x)
    # Assumes x's length is a multiple of 8
    laneIndices = SIMD.Vec{8, Int64}((1, 2, 3, 4, 5, 6, 7, 8))
    minvals = SIMD.Vec{8, Float64}(Inf)
    min_indices = SIMD.Vec{8, Int64}(0)

    lane = VecRange{8}(0)
    i = 1
    @inbounds @fastmath while i <= length(x)
        predicate = x[lane + i] < minvals
        minvals = vifelse(predicate, x[lane + i], minvals)
        min_indices = vifelse(predicate, laneIndices, min_indices)

        i += 8
        laneIndices += 8
    end

    min_value = hor_min(minvals)
    min_index = min_indices[trailing_zeros(getBits(min_value == minvals)) + 1]
    return min_value, min_index
end

I get

julia> using BenchmarkTools

julia> x = rand(512);

julia> @benchmark my_findmin($(Ref(x))[]) samples=10^5
BenchmarkTools.Trial: 43356 samples with 903 evaluations.
 Range (min … max):  122.924 ns … 472.468 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     125.875 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   126.235 ns Β±   6.468 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–ˆβ–…β–β–β–β–‡β–†β–‚β–ƒβ–‡β–…β–‚β–‚β–‚β–‚β–β–                                            β–ƒ
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–†β–‡β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–…β–†β–†β–†β–…β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–‡β–†β–†β–† β–ˆ
  123 ns        Histogram: log(frequency) by time        151 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> function naive_findmin(w)
           x = @fastmath foldl(min, w)
           i = findfirst(==(x), w)::Int
           x, i
       end
naive_findmin (generic function with 2 methods)

julia> @benchmark naive_findmin($(Ref(x))[]) samples=10^5
BenchmarkTools.Trial: 58706 samples with 285 evaluations.
 Range (min … max):  280.425 ns … 936.267 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     288.126 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   293.907 ns Β±  23.464 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–ˆβ–ˆβ–„β–‚β–ƒβ–ƒβ–…β–„β–‚β–β–β–β–               ▁                               β–‚
  β–ˆβ–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–ˆβ–ˆβ–‡β–†β–†β–†β–†β–†β–†β–†β–†β–†β–…β–…β–†β–†β–†β–†β–…β–†β–…β–…β–„β–…β–„β–…β–„β–…β–…β–…β–„ β–ˆ
  280 ns        Histogram: log(frequency) by time        426 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I don’t think extending my code to support non multiple of 8 lengths should slow things down much.
EDIT: oops, screwed up calculating the indices, not as fast as it first appeared. Though if you have avx512 it will probably be 4x faster.

2 Likes

is this faster than the SIMD.minimum()?

I copied the output, and then modified it/cleaned it up.

I’d also suggest trying that to see the approach LV uses, which is still faster than the other implementations here on my 10980XE (which has AVX512):

julia> @be sample(my_findmin)
Benchmark: 4263 samples with 1 evaluation
 min    18.315 ΞΌs (1 allocs: 4.125 KiB)
 median 18.607 ΞΌs (1 allocs: 4.125 KiB)
 mean   19.118 ΞΌs (1 allocs: 4.125 KiB)
 max    245.306 ΞΌs (1 allocs: 4.125 KiB)

julia> @be sample(fast_findmin)
Benchmark: 3097 samples with 2 evaluations
 min    14.210 ΞΌs (1 allocs: 4.125 KiB)
 median 14.554 ΞΌs (1 allocs: 4.125 KiB)
 mean   14.754 ΞΌs (1 allocs: 4.125 KiB)
 max    24.572 ΞΌs (1 allocs: 4.125 KiB)

julia> @be sample(naive_findmin)
Benchmark: 213 samples with 1 evaluation
 min    250.449 ΞΌs (1 allocs: 4.125 KiB)
 median 436.646 ΞΌs (1 allocs: 4.125 KiB)
 mean   426.060 ΞΌs (1 allocs: 4.125 KiB)
 max    633.860 ΞΌs (1 allocs: 4.125 KiB)

julia> @be sample(simd_findmin)
Benchmark: 688 samples with 1 evaluation
 min    35.047 ΞΌs (1 allocs: 4.125 KiB)
 median 130.514 ΞΌs (1 allocs: 4.125 KiB)
 mean   132.323 ΞΌs (1 allocs: 4.125 KiB)
 max    288.786 ΞΌs (1 allocs: 4.125 KiB)
1 Like

When I wrote it, it must have been. No clue if that’s still the case or even on your hardware, feel free to test it.