Faster `min`,`max` for `Float64|32`

That is good news. Thanks for looking at this.

Of course, thanks! (I also checked there arenā€™t any in the version with Base.min either)

I was bothered because I had inconsistent results for smaller vectors, but it appears to have been a mistake on my part (polluted REPL or the like)

Judging by the look of your native code, this is on an AVX-512 machine?

LLVM likes to turn simple branches into selects if it lets it SIMD the loop.
I also have problems with inconsistent benchmark results sometimes.

I can run @benchmark and the timing is very consistent across 10_000 or so iterations, and then rerun the same line and again get very consistent results ā€“ but now maybe 10% higher or lower.
71x71 matrix multiplication with a simple @avx tripple nested for loop being the prime example where I see that. Iā€™m not sure branch prediction isnā€™t involved, but using https://github.com/JuliaPerf/LinuxPerf.jl, I donā€™t recall seeing any major differences other than simply average # insructions / ns ranging from around 2.6 to 2.9; IIRC branch prediction and cache hits looked similar.

Yes ā€“ I was replying to your request but forgot to actually say so. I edited my comment to clarify.

3 Likes

I incorporated your branch savings into the code at top. Good idea.

There doesnā€™t seem to be anything wrong with either min or max in isolation, but with minmax you force it do run both subtractions, while it seems like you can get away with changing either, e.g.:

function man(a::T, b::T) where {T<:FastFloat}
    a_b = a - b
    (signbit(a_b) | isnan(a)) ? b : a
end

doing away with one and the same subtraction, and fewer jumps/shorter assembly:

@code_native minmax(0.0, -0.0)
1 Like

I find using signbit(_) || isnan(_) benchmarks a little faster than
using signbit(_) | isnan(_). The native code is a bit tighter, as well.

I am making that adjustment. Let me know if your results differ.

Benchmarking the two approaches favors the original way. Curious.

I understand you want to reuse min and max, but this is likely faster:

function minmax(a::T, b::T) where {T<:FastFloat}
    a_b = a - b
    (signbit(a_b) || isnan(a)) ? (a, b) : (b, a)
end

This one has only one jump, and strangely more by changing to | from || (EDIT: I see it now, it has a conditional move, cmovnel, instead, probably a good thing, I also see my older code gives similar if not the same code):

function minmax(a::Float64, b::Float64)
    a_b = a - b
    ((reinterpret(UInt64, a_b) >> (64-1)) != 0) || isnan(a) ? (a, b) : (b, a)
end

Here Juliaā€™s compiler is too clever, and while Iā€™m trying to have only one jump (with | to disable short-circuiting) it adds another, and if I read the assembly correctly it does something like isnan(a) || ā€¦ :

function minmax(a::Float64, b::Float64)
    a_b = a - b
    ((reinterpret(UInt64, a_b) >> (64-1))) % Bool | (isnan(a)) ? (a, b) : (b, a)
end

if I run with lower optimization I get kind of what I wanted, one jump, but longer assembly:

julia -O1

Youā€™re right that itā€™s probably faster, but it loses some of the IEEE-754 compliance with NaN for max:

Test Code
const FastFloat = Union{Float32, Float64}

function minmax_(a::T, b::T) where {T<:FastFloat}
    a_b = a - b
    (signbit(a_b) || isnan(a)) ? (a, b) : (b, a)
end

max_(a, b) = minmax_(a, b)[2]

function repr_(x)
    if isnan(x)
        "NaN($(repr(reinterpret(UInt, x))))"
    else
        repr(x)
    end
end

# NaN with a payload
NaN1 =  reinterpret(Float64, 0x7ff800000000dead)

# Tests
for a in (-Inf, -1., -0., 0., 1., Inf, NaN)
    for b in (-Inf, -1., -0., 0., 1., Inf, NaN1)
        if max(a,b)!==max_(a,b)
            println("max($a, $b) -> ", "\t", repr_(max(a, b)), "\t", repr_(max_(a,b)))
        end
    end
end

Results:

#                       Base                    Proposed implementation
max(-Inf, NaN) ->       NaN(0x7ff800000000dead) -Inf
max(-1.0, NaN) ->       NaN(0x7ff800000000dead) -1.0
max(-0.0, NaN) ->       NaN(0x7ff800000000dead) -0.0
max(0.0, NaN) ->        NaN(0x7ff800000000dead) 0.0
max(1.0, NaN) ->        NaN(0x7ff800000000dead) 1.0
max(Inf, NaN) ->        NaN(0x7ff800000000dead) Inf
max(NaN, -Inf) ->       NaN(0x7ff8000000000000) -Inf
max(NaN, -1.0) ->       NaN(0x7ff8000000000000) -1.0
max(NaN, -0.0) ->       NaN(0x7ff8000000000000) -0.0
max(NaN, 0.0) ->        NaN(0x7ff8000000000000) 0.0
max(NaN, 1.0) ->        NaN(0x7ff8000000000000) 1.0
max(NaN, Inf) ->        NaN(0x7ff8000000000000) Inf

(This is because it canā€™t be assumed that signbit(b-a) is different from signbit(a-b): while this is valid for most values, it isnā€™t when a or b is NaN)

3 Likes

More exhaustive benchmarking tends to corroborate this: signbit || isnan seems to be significantly faster for AVX-2 architectures. Here is a typical plot of the time per element vs array size, for AVX-2 systems (this particular one has been obtained with an Intel Core i5-6200U CPU, but Iā€™ve also tested on Core i5-9300H and Xeon E-2176G, which yield similar curves)


Whereas there seems to be a slight advantage for the signbit | isnan implementation for AVX-512 systems (the curve below has been obtained for an Intel Xeon Platinum 8256 CPU, but Iā€™ve also had similar results on a Xeon Gold 6128)

Overall, the performance difference is more significant for AVX-2, so I think youā€™re right to keep the signbit || isnan variant by default.


In case anyone would like to reproduce, below is the code used to generate the data and plot the graphs (I must shamefully admit I used gnuplot instead of Plots.jl)

Benchmarking code
const FastFloat = Union{Float32, Float64}

function min_1(a::T, b::T) where {T<:FastFloat}
    a_b = a-b
    (signbit(a_b) || isnan(a)) ? a : b
end

function min_2(a::T, b::T) where {T<:FastFloat}
    a_b = a-b
    (signbit(a_b) | isnan(a)) ? a : b
end

function bench!(c, a, b, f)
    @simd for i in eachindex(c)
        @inbounds c[i] = f(a[i], b[i])
    end
end

using Random
using BenchmarkTools
using Printf
BenchmarkTools.DEFAULT_PARAMETERS.samples = 100000

a = Float64[]
b = similar(a)
c = similar(a)

n = 1
@printf("%-10s", "#elems")
@printf("  %14s", "Base.min")
@printf("  %14s", "signbit||isnan")
@printf("  %14s", "signbit|isnan")
print("\n")
while n <= 300
    N = n*64
    resize!(a, N)
    resize!(b, N)
    resize!(c, N)

    rand!(a); rand!(b);
    fill!(c, 0.); bench!(c, a, b, min);   c1=copy(c);
    fill!(c, 0.); bench!(c, a, b, min_1); c2=copy(c); @assert all(c1 .== c2)
    fill!(c, 0.); bench!(c, a, b, min_2); c3=copy(c); @assert all(c1 .== c3)

    @printf("%-10d", N)
    @printf("  %14.2e", @belapsed(bench!(c, a, b, min),   setup=(rand!(a); rand!(b))) / N)
    @printf("  %14.2e", @belapsed(bench!(c, a, b, min_1), setup=(rand!(a); rand!(b))) / N)
    @printf("  %14.2e", @belapsed(bench!(c, a, b, min_2), setup=(rand!(a); rand!(b))) / N)
    print("\n"); flush(stdout)

    global n = max(n+1, Int(floor(n*1.1)))
end
Gnuplot script
set terminal pdfcairo
set output "bench_min.pdf"

set logscale x
set yrange [0:*]

set xlabel "#elems"
set ylabel "s/elem"

plot "bench_min.dat" u 1:2 w l t "Base.min", \
     ""              u 1:3 w l t "signbit() || isnan()", \
     ""              u 1:4 w l t "signbit() | isnan()"
9 Likes

And for the sake of completeness, here is that last plot (Xeon Platinum 8256) updated to include Base.FastMath.min_fast:

min_fast is yet a bit faster, but it should be kept in mind that it

  • does not handle signed zeros
  • does not always propagate NaNs (interestingly, it seems to correctly propagate NaN payloads when it does propagate NaNs)

IEEE-754 test code
using Base.FastMath: min_fast

function repr_(x)
    if isnan(x)
        "NaN($(repr(reinterpret(UInt, x))))"
    else
        repr(x)
    end
end


# NaN with a payload
NaN1 =  reinterpret(Float64, 0x7ff800000000dead)

for a in (-Inf, -1., -0., 0., 1., Inf, NaN1)
    for b in (-Inf, -1., -0., 0., 1., Inf, NaN1)
        if min(a,b)!==min_fast(a,b)
            println("min($a, $b) -> ", "\t", repr_(min(a, b)), "\t", repr_(min_fast(a,b)))
        end
    end
end
#                       Base.min                Base.FastMath.min_fast
min(-0.0, 0.0) ->       -0.0                    0.0
min(NaN, -Inf) ->       NaN(0x7ff800000000dead) -Inf
min(NaN, -1.0) ->       NaN(0x7ff800000000dead) -1.0
min(NaN, -0.0) ->       NaN(0x7ff800000000dead) -0.0
min(NaN, 0.0) ->        NaN(0x7ff800000000dead) 0.0
min(NaN, 1.0) ->        NaN(0x7ff800000000dead) 1.0
min(NaN, Inf) ->        NaN(0x7ff800000000dead) Inf
4 Likes

Thanks for the helpful analysis.

IEEE754-2019 removed minNum maxNum, minNumMag, maxNumMag (present in IEEE754-2008) because of inconsistencies (non-associativity loomed large).; see The Removal/Demotion of MinNum and MaxNum Operations from IEEE 754ā„¢-2018.

IEEE754-2019 introduces minimum, maximum, minimumNumber, maximumNumber.

minimum and maximum behave as expected for nonNaNs (-0.0 preceeds 0.0) and return the NaN if only one arg is NaN. If both args are NaNs, it returns one of them (which one is implementation defined).

minimumNumber and maximumNumber behave as expected for nonNaNs (-0.0 preceeds 0.0) and return the nonNan number if only one arg is NaN. If both args are NaNs, it returns one of them (which one is implementation defined).

Both the 2008 and the 2019 Standards define the TotalOrder predicate which takes two args and returns true where arg1 preceeds arg2, false where arg2 preceeds arg1, and true where arg1 == arg2. The rules for this predicate can be used to specify another set of min, max, minmax functions.

Also (for TotalOrder) -0.0 preceeds 0.0, negatively signed NaNs preceed nonnegative NaNs and finite floats, nonnegatively signed NaNs follow negatively signed NaNs and finite floats; the ordering for same signed NaNs is implementation defined (although payload order is suggested in other docs ā€“ see the suggestions in the doc cited above).

other notes
  • I use functions equivalent to the Mag forms frequently
  • for NaNs with different payloads and signs: QNaNs.jl
3 Likes

Thanks for the clarification. I thought IEEE mandated the first (left) NaN to be propagated; not sure whyā€¦

The current proposed implementation does not commute with negatively signed NaNs. It does with nonnegative NaNs. On the other hand, Julia does not produce negative NaNs in the course of computation. One must introduce them to have them present. No current Julia package that I know of does this.

min(NaN, 1.0) === NaN
min(1.0, NaN) === NaN

negNaN = reinterpret(Float64, 0xfff8000000000000)
min(negNaN, 1.0) === negNaN
min(1.0, negNaN) === 1.0

Very interesting thread, thank you all! Was this implemented in a package or even a PR to Base? (Sorry if I missed such a link aboveā€¦)

1 Like

In fact negNaN (or -NaN) has been used in the test system. Weā€™d better not drop that.
At present, if the two inputs are NaNs with different signs:

  • Base.min always returns the one with positive sign
  • Base.max always returns the one with negetive sign
  • Base.minmax(x, y) always returns (x, x)

I just wonder which style is better?

If min and max also return the first of two NaNs (no matter their signs), I think the following code could handle -NaN and Float16 better:

FastFloat = Union{Float32, Float64}
_isless(x::T, y::T) where {T<:AbstractFloat} = (x < y) || (signbit(x) > signbit(y))
min(x::T, y::T) where {T<:AbstractFloat} = isnan(x) || ~isnan(y) && _isless(x, y) ? x : y
max(x::T, y::T) where {T<:AbstractFloat} = isnan(x) || ~isnan(y) && _isless(y, x) ? x : y
minmax(x::T, y::T) where {T<:AbstractFloat} = min(x, y), max(x, y)

_isless(x::T, y::T) where {T<:FastFloat} = signbit(x - y)
min(x::T, y::T) where {T<:FastFloat} = ifelse(isnan(x) | ~isnan(y) & _isless(x, y), x, y)
max(x::T, y::T) where {T<:FastFloat} = ifelse(isnan(x) | ~isnan(y) & _isless(y, x), x, y)

_isless(x::Float16, y::Float16) = _isless(widen(x), widen(y))