Float64 comparison operator performance

I expect the performance of a < b to be the same as Base.lt(Base.Forward, a, b) if all type information is known at compile time, but I’m observing a 2.5x difference with Float64.

using BenchmarkTools
using Random

Random.seed!(0)
xs = rand(10^6)

function f1(a::Vector)
    n = length(a)
    c = 0
    for i = 2:n
        if a[i-1] < a[i]
        # Same as:
        #if Base.lt_float(a[i-1], a[i])
            c += 1
        end
    end
    return c
end

function f2(a::Vector{Float64})
    n = length(a)
    c = 0
    for i = 2:n
        if Base.lt(Base.Forward, a[i-1], a[i])
        # Same as:
        #if Base.fpislt(a[i-1], a[i])
            c += 1
        end
    end
    return c
end

julia> @benchmark f1($xs)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     908.224 μs (0.00% GC)
  median time:      937.959 μs (0.00% GC)
  mean time:        951.230 μs (0.00% GC)
  maximum time:     1.760 ms (0.00% GC)
  --------------
  samples:          5229
  evals/sample:     1

julia> @benchmark f2($xs)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.653 ms (0.00% GC)
  median time:      2.806 ms (0.00% GC)
  mean time:        2.831 ms (0.00% GC)
  maximum time:     4.744 ms (0.00% GC)
  --------------
  samples:          1763
  evals/sample:     1

julia> 

The root of the problem seem to be a difference of calling the faster Base.lt_float versus the slower Base.fpislt.
https://github.com/JuliaLang/julia/blob/9a1dbc038587c6072ac99699e416cbc8908054ed/base/float.jl#L458-L465

julia> @code_native 1.0 < 2.0
    .text
; ┌ @ float.jl:452 within `<'
    vucomisd    %xmm0, %xmm1
    seta    %al
    retq
    nopl    (%rax,%rax)
; └

julia> @code_native isless(1.0, 2.0)
    .text
; ┌ @ float.jl:459 within `isless'
    vmovq   %xmm0, %rax
    vmovq   %xmm1, %rcx
    testq   %rax, %rax
    sets    %dl
    setns   %sil
    cmpq    %rcx, %rax
    seta    %al
    setl    %cl
    andb    %dl, %al
    andb    %sil, %cl
    orb %al, %cl
    vucomisd    %xmm1, %xmm0
    setnp   %dl
    andb    %cl, %dl
    vucomisd    %xmm1, %xmm1
    setp    %cl
    vucomisd    %xmm0, %xmm0
    setnp   %al
    andb    %cl, %al
    orb %dl, %al
    retq
; └

What is the purpose of these two variants?

Also, it would be nice to look at the source of non-generic functions directly.

julia> @code_native Base.lt_float(1.0, 2.0)
ERROR: ArgumentError: argument is not a generic function

julia> @code_native Base.fpislt(1.0, 2.0)
ERROR: ArgumentError: argument is not a generic function

Base.lt is a total order, but < is not since all comparisons with NaN values return false according to the IEEE floating-point standard:

julia> Base.lt(Base.Forward, NaN, 1.0)
false

julia> Base.lt(Base.Forward, 1.0, NaN)
true

julia> NaN < 1.0
false

julia> 1.0 < NaN
false
3 Likes

Is it possible to use the faster < version with the flexible ordering offered by Base.lt and Base.Ordering?
I’m trying to make DataStructures behave more like sort, where users can drop in their own orderings but I don’t want to introduce a performance regression for the common case which does best with the existing hardcoded <.
Thinking of something like the following, but I also want to make sure this is compatible with non-floats too.

Base.lt(CustomForwardFaster, a, b)

Is this a situation that would benefit from @fastmath?

Edit:
@fastmath does not improve benchmarking results.

Edit 2:

Here’s my workaround to disregard special NAN cases for floats, but still allow drop-in replacement by other Base.Ordering types:

struct FasterForward <: Base.Ordering end
Base.lt(o::FasterForward, a, b) = a < b

function f3(a::Vector{Float64})
    n = length(a)
    c = 0
    for i = 2:n
        if Base.lt(FasterForward(), a[i-1], a[i])
            c += 1
        end
    end
    return c
end

julia> @benchmark f3($xs)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     908.760 μs (0.00% GC)
  median time:      921.672 μs (0.00% GC)
  mean time:        939.581 μs (0.00% GC)
  maximum time:     1.666 ms (0.00% GC)
  --------------
  samples:          5293
  evals/sample:     1

Why do you need to go through those constructs? Most APIs which need a comparison operator allow you to provide one.

That’s a good question.

My understanding is that passing an ordering type seems to be more flexible than passing a comparison operator or function.

For example, lets say you have a custom struct, and you want to include some additional information about how to perform comparisons on these structs. These comparison rules might also change as the program is running.

You can easily include all of these comparison rules as data in a Base.Ordering struct. Here’s a code snippet describing that process.

The alternative is to convert this comparison rule data into a new comparison function each time the rules change. I can speculate on the downsides of this latter approach, but I haven’t tried it out yet. I assume Base.Ordering was provided as an alternative.

Maybe another reader could offer more concrete info about the origins and benefits of Base.Ordering.

As another aside, it seems that using FasterForward (which achieved 2x performance in the earlier example) is slightly slower to sort 10^6 values, but still faster to sort 10^3 values.

Not sure what could explain this result.

using BenchmarkTools
using Random
Random.seed!(0)
xs = rand(10^6)
struct FasterForward <: Base.Ordering end
Base.lt(o::FasterForward, a, b) = a < b

julia> @benchmark sort!(v, order=ord) setup=(v = copy(xs); ord = Base.Forward)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     69.552 ms (0.00% GC)
  median time:      69.888 ms (0.00% GC)
  mean time:        70.117 ms (0.00% GC)
  maximum time:     74.023 ms (0.00% GC)
  --------------
  samples:          68
  evals/sample:     1

julia> @benchmark sort!(v, order=ord) setup=(v = copy(xs); ord = FasterForward())
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     76.176 ms (0.00% GC)
  median time:      76.956 ms (0.00% GC)
  mean time:        78.676 ms (0.00% GC)
  maximum time:     99.118 ms (0.00% GC)
  --------------
  samples:          61
  evals/sample:     1


julia> xs = rand(10^3);

julia> @benchmark sort!(v, order=ord) setup=(v = copy(xs); ord = Base.Forward)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.262 μs (0.00% GC)
  median time:      7.548 μs (0.00% GC)
  mean time:        7.672 μs (0.00% GC)
  maximum time:     12.796 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark sort!(v, order=ord) setup=(v = copy(xs); ord = FasterForward())
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.389 μs (0.00% GC)
  median time:      6.638 μs (0.00% GC)
  mean time:        6.809 μs (0.00% GC)
  maximum time:     14.075 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

AFAIK Base.Ordering is part of an internal API, see

https://github.com/JuliaLang/julia/pull/22388#issuecomment-362380962

I agree that it is confusing.

I’d like to know best practices on whether packages should rely on Ordering internally and in external APIs.

Read through that linked issue, and it seems that ordering is here to stay.

DataFrames.jl uses ordering.
https://github.com/JuliaData/DataFrames.jl/blob/118a30007be0db10c6a5b1c939cd29526cf61f63/src/abstractdataframe/sort.jl#L73-L78

DataStructures.jl has an open request to convert to ordering.
https://github.com/JuliaCollections/DataStructures.jl/issues/243

My understanding (from reading various comments) is that

  1. Base.Ordering was necessary as a compiler optimization (besides other useful purposes it served), but this is no longer true

  2. but whether to remove it altogether and what (if anything) should replace it is not decided, because of other priorities.

So I would not rely on it for now.

Also, since it is neither exported nor mentioned in the manual, I don’t think it is part of the official API. Technically it could be removed at any time (but of course deprecating it first would be more elegant).