Faster `min`,`max` for `Float64|32`

JeffreySarnoff · August 24, 2020, 9:45pm

That is good news. Thanks for looking at this.

ffevotte · August 24, 2020, 9:49pm

Of course, thanks! (I also checked there aren’t any in the version with Base.min either)

I was bothered because I had inconsistent results for smaller vectors, but it appears to have been a mistake on my part (polluted REPL or the like)

Judging by the look of your native code, this is on an AVX-512 machine?

Elrod · August 24, 2020, 9:58pm

LLVM likes to turn simple branches into selects if it lets it SIMD the loop.
I also have problems with inconsistent benchmark results sometimes.

I can run @benchmark and the timing is very consistent across 10_000 or so iterations, and then rerun the same line and again get very consistent results – but now maybe 10% higher or lower.
71x71 matrix multiplication with a simple @avx tripple nested for loop being the prime example where I see that. I’m not sure branch prediction isn’t involved, but using https://github.com/JuliaPerf/LinuxPerf.jl, I don’t recall seeing any major differences other than simply average # insructions / ns ranging from around 2.6 to 2.9; IIRC branch prediction and cache hits looked similar.

Yes – I was replying to your request but forgot to actually say so. I edited my comment to clarify.

JeffreySarnoff · August 24, 2020, 9:59pm

I incorporated your branch savings into the code at top. Good idea.

Palli · August 24, 2020, 10:50pm

There doesn’t seem to be anything wrong with either min or max in isolation, but with minmax you force it do run both subtractions, while it seems like you can get away with changing either, e.g.:

function man(a::T, b::T) where {T<:FastFloat}
    a_b = a - b
    (signbit(a_b) | isnan(a)) ? b : a
end

doing away with one and the same subtraction, and fewer jumps/shorter assembly:

@code_native minmax(0.0, -0.0)

JeffreySarnoff · August 25, 2020, 12:48am

I find using signbit(_) || isnan(_) benchmarks a little faster than
using signbit(_) | isnan(_). The native code is a bit tighter, as well.

I am making that adjustment. Let me know if your results differ.

JeffreySarnoff · August 25, 2020, 12:55am

Benchmarking the two approaches favors the original way. Curious.

Palli · August 25, 2020, 10:14am

I understand you want to reuse min and max, but this is likely faster:

function minmax(a::T, b::T) where {T<:FastFloat}
    a_b = a - b
    (signbit(a_b) || isnan(a)) ? (a, b) : (b, a)
end

Palli · August 25, 2020, 11:04am

This one has only one jump, and strangely more by changing to | from || (EDIT: I see it now, it has a conditional move, cmovnel, instead, probably a good thing, I also see my older code gives similar if not the same code):

function minmax(a::Float64, b::Float64)
    a_b = a - b
    ((reinterpret(UInt64, a_b) >> (64-1)) != 0) || isnan(a) ? (a, b) : (b, a)
end

Palli · August 25, 2020, 11:24am

Here Julia’s compiler is too clever, and while I’m trying to have only one jump (with | to disable short-circuiting) it adds another, and if I read the assembly correctly it does something like isnan(a) || … :

function minmax(a::Float64, b::Float64)
    a_b = a - b
    ((reinterpret(UInt64, a_b) >> (64-1))) % Bool | (isnan(a)) ? (a, b) : (b, a)
end

if I run with lower optimization I get kind of what I wanted, one jump, but longer assembly:

julia -O1

ffevotte · August 26, 2020, 6:57am

You’re right that it’s probably faster, but it loses some of the IEEE-754 compliance with NaN for max:

Test Code

const FastFloat = Union{Float32, Float64}

function minmax_(a::T, b::T) where {T<:FastFloat}
    a_b = a - b
    (signbit(a_b) || isnan(a)) ? (a, b) : (b, a)
end

max_(a, b) = minmax_(a, b)[2]

function repr_(x)
    if isnan(x)
        "NaN($(repr(reinterpret(UInt, x))))"
    else
        repr(x)
    end
end

# NaN with a payload
NaN1 =  reinterpret(Float64, 0x7ff800000000dead)

# Tests
for a in (-Inf, -1., -0., 0., 1., Inf, NaN)
    for b in (-Inf, -1., -0., 0., 1., Inf, NaN1)
        if max(a,b)!==max_(a,b)
            println("max($a, $b) -> ", "\t", repr_(max(a, b)), "\t", repr_(max_(a,b)))
        end
    end
end

Results:

#                       Base                    Proposed implementation
max(-Inf, NaN) ->       NaN(0x7ff800000000dead) -Inf
max(-1.0, NaN) ->       NaN(0x7ff800000000dead) -1.0
max(-0.0, NaN) ->       NaN(0x7ff800000000dead) -0.0
max(0.0, NaN) ->        NaN(0x7ff800000000dead) 0.0
max(1.0, NaN) ->        NaN(0x7ff800000000dead) 1.0
max(Inf, NaN) ->        NaN(0x7ff800000000dead) Inf
max(NaN, -Inf) ->       NaN(0x7ff8000000000000) -Inf
max(NaN, -1.0) ->       NaN(0x7ff8000000000000) -1.0
max(NaN, -0.0) ->       NaN(0x7ff8000000000000) -0.0
max(NaN, 0.0) ->        NaN(0x7ff8000000000000) 0.0
max(NaN, 1.0) ->        NaN(0x7ff8000000000000) 1.0
max(NaN, Inf) ->        NaN(0x7ff8000000000000) Inf

(This is because it can’t be assumed that signbit(b-a) is different from signbit(a-b): while this is valid for most values, it isn’t when a or b is NaN)

ffevotte · August 26, 2020, 9:33am

More exhaustive benchmarking tends to corroborate this: signbit || isnan seems to be significantly faster for AVX-2 architectures. Here is a typical plot of the time per element vs array size, for AVX-2 systems (this particular one has been obtained with an Intel Core i5-6200U CPU, but I’ve also tested on Core i5-9300H and Xeon E-2176G, which yield similar curves)

Whereas there seems to be a slight advantage for the signbit | isnan implementation for AVX-512 systems (the curve below has been obtained for an Intel Xeon Platinum 8256 CPU, but I’ve also had similar results on a Xeon Gold 6128)

Overall, the performance difference is more significant for AVX-2, so I think you’re right to keep the signbit || isnan variant by default.

In case anyone would like to reproduce, below is the code used to generate the data and plot the graphs (I must shamefully admit I used gnuplot instead of Plots.jl)

Benchmarking code

const FastFloat = Union{Float32, Float64}

function min_1(a::T, b::T) where {T<:FastFloat}
    a_b = a-b
    (signbit(a_b) || isnan(a)) ? a : b
end

function min_2(a::T, b::T) where {T<:FastFloat}
    a_b = a-b
    (signbit(a_b) | isnan(a)) ? a : b
end

function bench!(c, a, b, f)
    @simd for i in eachindex(c)
        @inbounds c[i] = f(a[i], b[i])
    end
end

using Random
using BenchmarkTools
using Printf
BenchmarkTools.DEFAULT_PARAMETERS.samples = 100000

a = Float64[]
b = similar(a)
c = similar(a)

n = 1
@printf("%-10s", "#elems")
@printf("  %14s", "Base.min")
@printf("  %14s", "signbit||isnan")
@printf("  %14s", "signbit|isnan")
print("\n")
while n <= 300
    N = n*64
    resize!(a, N)
    resize!(b, N)
    resize!(c, N)

    rand!(a); rand!(b);
    fill!(c, 0.); bench!(c, a, b, min);   c1=copy(c);
    fill!(c, 0.); bench!(c, a, b, min_1); c2=copy(c); @assert all(c1 .== c2)
    fill!(c, 0.); bench!(c, a, b, min_2); c3=copy(c); @assert all(c1 .== c3)

    @printf("%-10d", N)
    @printf("  %14.2e", @belapsed(bench!(c, a, b, min),   setup=(rand!(a); rand!(b))) / N)
    @printf("  %14.2e", @belapsed(bench!(c, a, b, min_1), setup=(rand!(a); rand!(b))) / N)
    @printf("  %14.2e", @belapsed(bench!(c, a, b, min_2), setup=(rand!(a); rand!(b))) / N)
    print("\n"); flush(stdout)

    global n = max(n+1, Int(floor(n*1.1)))
end

Gnuplot script

set terminal pdfcairo
set output "bench_min.pdf"

set logscale x
set yrange [0:*]

set xlabel "#elems"
set ylabel "s/elem"

plot "bench_min.dat" u 1:2 w l t "Base.min", \
     ""              u 1:3 w l t "signbit() || isnan()", \
     ""              u 1:4 w l t "signbit() | isnan()"

ffevotte · August 26, 2020, 10:23am

And for the sake of completeness, here is that last plot (Xeon Platinum 8256) updated to include Base.FastMath.min_fast:

min_fast is yet a bit faster, but it should be kept in mind that it

does not handle signed zeros
does not always propagate NaNs (interestingly, it seems to correctly propagate NaN payloads when it does propagate NaNs)

IEEE-754 test code

using Base.FastMath: min_fast

function repr_(x)
    if isnan(x)
        "NaN($(repr(reinterpret(UInt, x))))"
    else
        repr(x)
    end
end


# NaN with a payload
NaN1 =  reinterpret(Float64, 0x7ff800000000dead)

for a in (-Inf, -1., -0., 0., 1., Inf, NaN1)
    for b in (-Inf, -1., -0., 0., 1., Inf, NaN1)
        if min(a,b)!==min_fast(a,b)
            println("min($a, $b) -> ", "\t", repr_(min(a, b)), "\t", repr_(min_fast(a,b)))
        end
    end
end

#                       Base.min                Base.FastMath.min_fast
min(-0.0, 0.0) ->       -0.0                    0.0
min(NaN, -Inf) ->       NaN(0x7ff800000000dead) -Inf
min(NaN, -1.0) ->       NaN(0x7ff800000000dead) -1.0
min(NaN, -0.0) ->       NaN(0x7ff800000000dead) -0.0
min(NaN, 0.0) ->        NaN(0x7ff800000000dead) 0.0
min(NaN, 1.0) ->        NaN(0x7ff800000000dead) 1.0
min(NaN, Inf) ->        NaN(0x7ff800000000dead) Inf

JeffreySarnoff · August 28, 2020, 7:01pm

Thanks for the helpful analysis.

JeffreySarnoff · August 28, 2020, 9:53pm

IEEE754-2019 removed minNum maxNum, minNumMag, maxNumMag (present in IEEE754-2008) because of inconsistencies (non-associativity loomed large).; see The Removal/Demotion of MinNum and MaxNum Operations from IEEE 754™-2018.

IEEE754-2019 introduces minimum, maximum, minimumNumber, maximumNumber.

minimum and maximum behave as expected for nonNaNs (-0.0 preceeds 0.0) and return the NaN if only one arg is NaN. If both args are NaNs, it returns one of them (which one is implementation defined).

minimumNumber and maximumNumber behave as expected for nonNaNs (-0.0 preceeds 0.0) and return the nonNan number if only one arg is NaN. If both args are NaNs, it returns one of them (which one is implementation defined).

Both the 2008 and the 2019 Standards define the TotalOrder predicate which takes two args and returns true where arg1 preceeds arg2, false where arg2 preceeds arg1, and true where arg1 == arg2. The rules for this predicate can be used to specify another set of min, max, minmax functions.

Also (for TotalOrder) -0.0 preceeds 0.0, negatively signed NaNs preceed nonnegative NaNs and finite floats, nonnegatively signed NaNs follow negatively signed NaNs and finite floats; the ordering for same signed NaNs is implementation defined (although payload order is suggested in other docs – see the suggestions in the doc cited above).

other notes

I use functions equivalent to the Mag forms frequently
for NaNs with different payloads and signs: QNaNs.jl

ffevotte · August 28, 2020, 10:15pm

Thanks for the clarification. I thought IEEE mandated the first (left) NaN to be propagated; not sure why…

JeffreySarnoff · August 28, 2020, 11:40pm

The current proposed implementation does not commute with negatively signed NaNs. It does with nonnegative NaNs. On the other hand, Julia does not produce negative NaNs in the course of computation. One must introduce them to have them present. No current Julia package that I know of does this.

min(NaN, 1.0) === NaN
min(1.0, NaN) === NaN

negNaN = reinterpret(Float64, 0xfff8000000000000)
min(negNaN, 1.0) === negNaN
min(1.0, negNaN) === 1.0

ranocha · August 11, 2021, 12:57pm

Very interesting thread, thank you all! Was this implemented in a package or even a PR to Base? (Sorry if I missed such a link above…)

N5N3 · August 17, 2021, 12:25pm

In fact negNaN (or -NaN) has been used in the test system. We’d better not drop that.
At present, if the two inputs are NaNs with different signs:

Base.min always returns the one with positive sign
Base.max always returns the one with negetive sign
Base.minmax(x, y) always returns (x, x)

I just wonder which style is better?

If min and max also return the first of two NaNs (no matter their signs), I think the following code could handle -NaN and Float16 better:

FastFloat = Union{Float32, Float64}
_isless(x::T, y::T) where {T<:AbstractFloat} = (x < y) || (signbit(x) > signbit(y))
min(x::T, y::T) where {T<:AbstractFloat} = isnan(x) || ~isnan(y) && _isless(x, y) ? x : y
max(x::T, y::T) where {T<:AbstractFloat} = isnan(x) || ~isnan(y) && _isless(y, x) ? x : y
minmax(x::T, y::T) where {T<:AbstractFloat} = min(x, y), max(x, y)

_isless(x::T, y::T) where {T<:FastFloat} = signbit(x - y)
min(x::T, y::T) where {T<:FastFloat} = ifelse(isnan(x) | ~isnan(y) & _isless(x, y), x, y)
max(x::T, y::T) where {T<:FastFloat} = ifelse(isnan(x) | ~isnan(y) & _isless(y, x), x, y)

_isless(x::Float16, y::Float16) = _isless(widen(x), widen(y))