Faster `min`,`max` for `Float64|32`

JeffreySarnoff · August 24, 2020, 5:46pm

These realizations have simpler native code structure and benchmark faster than the current versions.

These implementations may be IEEE754 compliant.

const FastFloat = Union{Float32, Float64}

function min(a::T, b::T) where {T<:FastFloat}
    a_b = a - b
    (signbit(a_b) || isnan(a)) ? a : b
end

function max(a::T, b::T) where {T<:FastFloat}
    b_a = b - a
    (signbit(b_a) || isnan(a)) ? a : b
end

minmax(a::T, b::T) where {T<:FastFloat} =
    (min(a, b), max(a, b))

simeonschaub · August 24, 2020, 5:56pm

The problem with that seems to be that it doesn’t respect signed zeros on the right hand side, since:

minmax(0.0, -0.0) === (0.0, 0.0)

Perhaps that might be acceptable in most situations, but it’s probably a breaking change.

JeffreySarnoff · August 24, 2020, 5:56pm

fair point

DNF · August 24, 2020, 5:58pm

I too was surprised to see, recently, that min and max are quite a bit slower than, say, ifelse(x<y, x, y). But, looking into the source, you can see it does some extra stuff to cover corner cases.

For optimum performance, it’s better to roll your own.

simeonschaub · August 24, 2020, 6:04pm

Perhaps it makes sense to have functions unsafe_(min|max)+, for when you don’t care about strict IEEE compliance. Maybe in its own package?

JeffreySarnoff · August 24, 2020, 6:13pm

This was written before the current implementation (see above).

I like that idea. When I am using min, max the sign of zero is a nonissue.
The most common request I get regarding -0.0 is to quash it.

These functions should be IEEE754 conformant otherwise.

OK. I will package these and provide the appropriate coverage and info.

oxinabox · August 24, 2020, 6:58pm

We have Base.FastMath.min_fast etc.
which have very short LLVM

fior jmin being the version posted at the top

julia> @code_llvm debuginfo=:none Base.FastMath.min_fast(-0.0, 0.0)

define double @julia_min_fast_512(double, double) {
top:
  %.inv = fcmp olt double %0, %1
  %2 = select i1 %.inv, double %0, double %1
  ret double %2
}

julia> @code_llvm debuginfo=:none jmin(-0.0, 0.0)

define double @julia_jmin_514(double, double) {
top:
  %2 = fsub double %0, %1
  %3 = bitcast double %2 to i64
  %4 = icmp sgt i64 %3, -1
  br i1 %4, label %L6, label %L5

L5:                                               ; preds = %top
  ret double %0

L6:                                               ; preds = %top
  %5 = fsub double %0, %0
  %6 = fadd double %5, %1
  ret double %6
}

My benchmarking has Base.FastMath.min_fast come in a tiny bit faster. On various distributions of random inputs.
But I am not fully convinced it isn’t noise. I only tried 2-3 different times with BenchmarkTools.

Note however that FastMath operations are allowed to disregard both signed zeros and NaNs and Infss.
and they do in this case.

I think your defintions handle NaN and Inf correctly?

ffevotte · August 24, 2020, 6:59pm

AFACT, the +(a-a) trick covers corner cases involving NaN, but brings other issues with Inf:

julia> min_(Inf, -Inf)
NaN

JeffreySarnoff · August 24, 2020, 7:03pm

umm that’s true …

@simeonschaub I may have fixed the -0.0 handling (see edited code) with little additional overhead in benchmarking.

JeffreySarnoff · August 24, 2020, 7:26pm

I may have IEEE compliance with the latest revision (above).
Given the very good reasoning of others in this thread, I say “may”.

ffevotte · August 24, 2020, 7:27pm

Here is a minimalistic series of tests checking IEEE-754 in some corner cases:

Code

const FastFloat = Union{Float32, Float64}

min_(a::T, b::T) where {T<:FastFloat} =
    signbit(a - b) ? a : ifelse(b === -0.0, -0.0, b + (a - a))

max_(a::T, b::T) where {T<:FastFloat} =
    signbit(b - a) ? a : b + (a - a)

minmax_(a::T, b::T) where {T<:FastFloat} =
    (min(a, b), max(a, b))


function repr_(x)
    if isnan(x)
        "NaN($(repr(reinterpret(UInt, x))))"
    else
        repr(x)
    end
end

# NaN with a payload
NaN1 =  reinterpret(Float64, 0x7ff800000000dead)

for a in (-Inf, -1., -0., 0., 1., Inf, NaN)
    for b in (-Inf, -1., -0., 0., 1., Inf, NaN1)
        if min(a,b)!==min_(a,b)
            println("min($a, $b) -> ", "\t", repr_(min(a, b)), "\t", repr_(min_(a,b)))
        end
        if max(a,b)!==max_(a,b)
            println("max($a, $b) -> ", "\t", repr_(max(a, b)), "\t", repr_(max_(a,b)))
        end
    end
end

Results:

#                     Base                     Proposed implementation
max(-Inf,  -1.0)  ->  -1.0                     NaN(0xfff8000000000000)
max(-Inf,  -0.0)  ->  -0.0                     NaN(0xfff8000000000000)
max(-Inf,  0.0)   ->  0.0                      NaN(0xfff8000000000000)
max(-Inf,  1.0)   ->  1.0                      NaN(0xfff8000000000000)
max(-Inf,  Inf)   ->  Inf                      NaN(0xfff8000000000000)
min(-Inf,  NaN)   ->  NaN(0x7ff800000000dead)  NaN(0xfff8000000000000)
max(-Inf,  NaN)   ->  NaN(0x7ff800000000dead)  NaN(0xfff8000000000000)
max(-1.0,  -0.0)  ->  -0.0                     0.0
max(-0.0,  -0.0)  ->  -0.0                     0.0
min(Inf,   -Inf)  ->  -Inf                     NaN(0xfff8000000000000)
min(Inf,   -1.0)  ->  -1.0                     NaN(0xfff8000000000000)
min(Inf,   0.0)   ->  0.0                      NaN(0xfff8000000000000)
min(Inf,   1.0)   ->  1.0                      NaN(0xfff8000000000000)
min(Inf,   NaN)   ->  NaN(0x7ff800000000dead)  NaN(0xfff8000000000000)
max(Inf,   NaN)   ->  NaN(0x7ff800000000dead)  NaN(0xfff8000000000000)
min(NaN,   -0.0)  ->  NaN(0x7ff8000000000000)  -0.0
min(NaN,   NaN)   ->  NaN(0x7ff800000000dead)  NaN(0x7ff8000000000000)
max(NaN,   NaN)   ->  NaN(0x7ff800000000dead)  NaN(0x7ff8000000000000)

There are issues with

NaN payloads (not sure anybody uses them anyway)
signed 0 (for max)
infinities

JeffreySarnoff · August 24, 2020, 7:29pm

Signed zero for max now works.
Infinities now work.

ffevotte · August 24, 2020, 7:30pm

Looks much better, thanks!

Updated code

const FastFloat = Union{Float32, Float64}

function min_(a::T, b::T) where {T<:FastFloat}
    a_b = a-b
    signbit(a_b) ? a  : ifelse(isnan(b), a, b)
end

function max_(a::T, b::T) where {T<:FastFloat}
    b_a = b-a
    signbit(b_a) ? a : ifelse(isnan(a), a, b)
end

function repr_(x)
    if isnan(x)
        "NaN($(repr(reinterpret(UInt, x))))"
    else
        repr(x)
    end
end


# NaN with a payload
NaN1 =  reinterpret(Float64, 0x7ff800000000dead)

for a in (-Inf, -1., -0., 0., 1., Inf, NaN)
    for b in (-Inf, -1., -0., 0., 1., Inf, NaN1)
        if min(a,b)!==min_(a,b)
            println("min($a, $b) -> ", "\t", repr_(min(a, b)), "\t", repr_(min_(a,b)))
        end
        if max(a,b)!==max_(a,b)
            println("max($a, $b) -> ", "\t", repr_(max(a, b)), "\t", repr_(max_(a,b)))
        end
    end
end

Results:

#                     Base                     Proposed implementation
min(-Inf,  NaN)   ->  NaN(0x7ff800000000dead)  -Inf
min(-1.0,  NaN)   ->  NaN(0x7ff800000000dead)  -1.0
min(-0.0,  NaN)   ->  NaN(0x7ff800000000dead)  -0.0
min(0.0,   NaN)   ->  NaN(0x7ff800000000dead)  0.0
min(1.0,   NaN)   ->  NaN(0x7ff800000000dead)  1.0
min(Inf,   NaN)   ->  NaN(0x7ff800000000dead)  Inf
min(NaN,   -Inf)  ->  NaN(0x7ff8000000000000)  -Inf
min(NaN,   -1.0)  ->  NaN(0x7ff8000000000000)  -1.0
min(NaN,   -0.0)  ->  NaN(0x7ff8000000000000)  -0.0
min(NaN,   0.0)   ->  NaN(0x7ff8000000000000)  0.0
min(NaN,   1.0)   ->  NaN(0x7ff8000000000000)  1.0
min(NaN,   Inf)   ->  NaN(0x7ff8000000000000)  Inf
min(NaN,   NaN)   ->  NaN(0x7ff800000000dead)  NaN(0x7ff8000000000000)
max(NaN,   NaN)   ->  NaN(0x7ff800000000dead)  NaN(0x7ff8000000000000)

ffevotte · August 24, 2020, 7:41pm

Ah, there was a typo in your latest min implementation; it should read:

const FastFloat = Union{Float32, Float64}

function min_(a::T, b::T) where {T<:FastFloat}
    a_b = a-b
    signbit(a_b) ? a : ifelse(isnan(a), a, b)
end

Complete code

const FastFloat = Union{Float32, Float64}

function min_(a::T, b::T) where {T<:FastFloat}
a_b = a-b
signbit(a_b) ? a : ifelse(isnan(a), a, b)
end

function max_(a::T, b::T) where {T<:FastFloat}
b_a = b-a
signbit(b_a) ? a : ifelse(isnan(a), a, b)
end

function repr_(x)
if isnan(x)
“NaN($(repr(reinterpret(UInt, x))))”
else
repr(x)
end
end

NaN with a payload

NaN1 = reinterpret(Float64, 0x7ff800000000dead)

for a in (-Inf, -1., -0., 0., 1., Inf, NaN)
for b in (-Inf, -1., -0., 0., 1., Inf, NaN1)
if min(a,b)!==min_(a,b)
println("min($a, $b) → ", “\t”, repr_(min(a, b)), “\t”, repr_(min_(a,b)))
end
if max(a,b)!==max_(a,b)
println("max($a, $b) → ", “\t”, repr_(max(a, b)), “\t”, repr_(max_(a,b)))
end
end
end

With that, only NaN payloads cause issues:

#                       Base                    Proposed
min(NaN, NaN) ->        NaN(0x7ff800000000dead) NaN(0x7ff8000000000000)
max(NaN, NaN) ->        NaN(0x7ff800000000dead) NaN(0x7ff8000000000000)

JeffreySarnoff · August 24, 2020, 7:42pm

thank you – fixing

simeonschaub · August 24, 2020, 7:45pm

Is that still faster? That would be awesome! Different NaN payloads really seem like a no-brainer.

JeffreySarnoff · August 24, 2020, 7:56pm

Yes. These are still faster.
(quick benchmarking with @btime fn.(a,b) setup=(a=avec; b=bvec;)

with vectors of length 10_000

min, max are ~1.1x faster.
minmax is a great deal faster ~4.5x.

with vectors of length 64

min, max are ~1.15x faster.
minmax is ~1.5x faster.

I expect these implementations to work better than the current ones when combined with SIMD or LoopVectorization or …

@ffevotte If you have a way to test that, I am interested in the results.

ffevotte · August 24, 2020, 9:29pm

On the first round of testing, this is what gets me the best results so far (note that I slightly changed the implementation of min, fusing both tests to get an extra bit of performance):

const FastFloat = Union{Float32, Float64}

function min_(a::T, b::T) where {T<:FastFloat}
    a_b = a-b
    (signbit(a_b) | isnan(a)) ? a : b  # <-- only one branch here
end

function bench!(c, a, b, f)
    @simd for i in eachindex(c)
        @inbounds c[i] = f(a[i], b[i])
    end
end

using Random

a = zeros(10240);
b = similar(a);
c = similar(a);

rand!(a); rand!(b);
fill!(c, 0.); bench!(c, a, b, min);  c1=copy(c);
fill!(c, 0.); bench!(c, a, b, min_); c2=copy(c);
@assert all(c1 .== c2)

using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.samples = 100_000

Yielding, for Base.min:

julia> @benchmark bench!(c, a, b, min)   setup=(rand!(a); rand!(b))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.626 μs (0.00% GC)
  median time:      5.794 μs (0.00% GC)
  mean time:        5.914 μs (0.00% GC)
  maximum time:     57.391 μs (0.00% GC)
  --------------
  samples:          83420
  evals/sample:     6

vs. the proposed implementation:

julia> @benchmark bench!(c, a, b, min_)  setup=(rand!(a); rand!(b))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.130 μs (0.00% GC)
  median time:      4.247 μs (0.00% GC)
  mean time:        4.294 μs (0.00% GC)
  maximum time:     15.504 μs (0.00% GC)
  --------------
  samples:          91969
  evals/sample:     7

I’ve made initial attempts at manually vectorized implementations of the loop, but haven’t managed to get good results yet. I haven’t tried LoopVectorization either yet.

EDIT: this is on an AVX2 machine. It would be interesting to test this on an AVX512 system as well

Also, I suspect the benchmarks might not be entirely reliable because of branch prediction issues.

JeffreySarnoff · August 24, 2020, 9:35pm

nice!

Elrod · August 24, 2020, 9:39pm

There aren’t any branches other than whether or not we’re getting another iteration of the loop

L144:
        vmovupd zmm2, zmmword ptr [rcx + 8*rax]
        vmovupd zmm3, zmmword ptr [rcx + 8*rax + 64]
        vmovupd zmm4, zmmword ptr [rcx + 8*rax + 128]
        vmovupd zmm5, zmmword ptr [rcx + 8*rax + 192]
        vmovupd zmm6, zmmword ptr [rdx + 8*rax]
        vmovupd zmm7, zmmword ptr [rdx + 8*rax + 64]
        vmovupd zmm8, zmmword ptr [rdx + 8*rax + 128]
        vmovupd zmm9, zmmword ptr [rdx + 8*rax + 192]
        vsubpd  zmm10, zmm2, zmm6
        vsubpd  zmm11, zmm3, zmm7
        vsubpd  zmm12, zmm4, zmm8
        vsubpd  zmm13, zmm5, zmm9
        vpcmpgtq        k1, zmm10, zmm1
        vpcmpgtq        k2, zmm11, zmm1
        vpcmpgtq        k3, zmm12, zmm1
        vpcmpgtq        k4, zmm13, zmm1
        vcmpordpd       k1 {k1}, zmm2, zmm0
        vcmpordpd       k2 {k2}, zmm3, zmm0
        vcmpordpd       k3 {k3}, zmm4, zmm0
        vcmpordpd       k4 {k4}, zmm5, zmm0
        vmovapd zmm2 {k1}, zmm6
        vmovapd zmm3 {k2}, zmm7
        vmovapd zmm4 {k3}, zmm8
        vmovapd zmm5 {k4}, zmm9
        vmovupd zmmword ptr [rsi + 8*rax], zmm2
        vmovupd zmmword ptr [rsi + 8*rax + 64], zmm3
        vmovupd zmmword ptr [rsi + 8*rax + 128], zmm4
        vmovupd zmmword ptr [rsi + 8*rax + 192], zmm5
        add     rax, 32
        cmp     rdi, rax
        jne     L144

Benchmarks with an AVX512 machine:

julia> @benchmark bench!(c, a, b, min)   setup=(rand!(a); rand!(b))
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.140 μs (0.00% GC)
  median time:      2.183 μs (0.00% GC)
  mean time:        2.186 μs (0.00% GC)
  maximum time:     7.222 μs (0.00% GC)
  --------------
  samples:          100000
  evals/sample:     9

julia> @benchmark bench!(c, a, b, min_)   setup=(rand!(a); rand!(b))
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.783 μs (0.00% GC)
  median time:      1.799 μs (0.00% GC)
  mean time:        1.802 μs (0.00% GC)
  maximum time:     2.930 μs (0.00% GC)
  --------------
  samples:          100000
  evals/sample:     10

With smaller arrays, where we aren’t starved on memory, we see a much larger difference. Using a length of 1024 instead:

julia> @benchmark bench!($c, $a, $b, min)   setup=(rand!($a); rand!($b))
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     175.345 ns (0.00% GC)
  median time:      176.199 ns (0.00% GC)
  mean time:        176.495 ns (0.00% GC)
  maximum time:     216.782 ns (0.00% GC)
  --------------
  samples:          38994
  evals/sample:     719

julia> @benchmark bench!($c, $a, $b, min_)   setup=(rand!($a); rand!($b))
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     86.452 ns (0.00% GC)
  median time:      89.048 ns (0.00% GC)
  mean time:        89.098 ns (0.00% GC)
  maximum time:     130.425 ns (0.00% GC)
  --------------
  samples:          57633
  evals/sample:     958

Topic		Replies	Views
min(NaN,5) = NaN (why no @fastmath?) General Usage nan	47	1913	February 2, 2023
`minimum()` 3x - 6x slower than `numpy.min()` Performance question	16	531	November 2, 2024
C implementation of function being ~4 times faster even absence of allocs Performance ccall	28	1397	March 5, 2025
@fastmath for library functions? (2x slowdown of binary search because of NaN checks in Base) Performance fast-math	14	760	January 30, 2023
Why isn't minmax branchless? Internals & Design	22	4576	July 20, 2021

Faster `min`,`max` for `Float64|32`

This was written before the current implementation (see above).

NaN with a payload

Related topics