Performance of custom lt function in sort!

I am interested in writing my own lt function within a sort for better performance in a specific use case (in this case, I only care if the numbers are a set length apart from one another). However, it seems that any substitution of the lt function besides isless decreases performance.

Take this simple example:

@benchmark sort!(X, lt = isless) setup = (X = randn(1_000_000))

This benchmarks as

BenchmarkTools.Trial: 65 samples with 1 evaluation.
 Range (min … max):  73.737 ms …  75.956 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     74.748 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   74.770 ms ± 469.959 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▄▁▁▁    ▄  ██ ▄ ▄▁▁▁▁▁▄      ▁                  
  ▆▁▆▁▆▆▁▁▁▁▁▁▆▁████▆▁▆▆█▆▆██▁█▆███████▁▆▆▆▆▆█▆▁▆▁▆▁▁▁▁▆▆▁▆▁▁▆ ▁
  73.7 ms         Histogram: frequency by time         75.8 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

However, the simple substitution can be made as:

new_lt(x::AbstractFloat, y::AbstractFloat) = x < y
@benchmark sort!(X, lt = new_lt) setup = (X = randn(1_000_000))

This should do the same thing, I think. If anything, it should bypass checks related to isnan. However, the performance decreases, and the above code benchmarks as follows:

BenchmarkTools.Trial: 56 samples with 1 evaluation.
 Range (min … max):  86.292 ms …  90.038 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     87.803 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   87.871 ms ± 885.583 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁   ▁       ▁▁▁       ▁ ▄   ▁    ▄  ▁            ▁█           
  █▁▁▁█▁▁▆▆▆▁▁███▆▆▆▆▆▆▁█▆█▁▁▆█▆▁▆▁█▆▆█▆▆▆▁▆▁▁▆▆▆▁▁██▁▆▆▆▁▁▁▆▆ ▁
  86.3 ms         Histogram: frequency by time         89.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

I’m curious what the difference is here, and if there’s any way around it?

what’s more surprising is that < for Float64 is just Base.lt_float yet:

julia> @benchmark sort!(X, lt = Base.lt_float) setup = (X = randn(1_000_000))
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  1.125 s …   1.264 s  ┊ GC (min … max): 1.53% … 9.27%
 Time  (median):     1.131 s              ┊ GC (median):    1.52%
 Time  (mean ± σ):   1.163 s ± 59.183 ms  ┊ GC (mean ± σ):  3.20% ± 3.47%

  ███               █                                     █  
  ███▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.12 s         Histogram: frequency by time        1.26 s <

 Memory estimate: 1.02 GiB, allocs estimate: 68303322.

this is still possible if your “metric” is special, but it’s hard to beat what’s in Base if you’re also comparing by number size, since it’s optizmied (I assume) already.

The default sort order is heavily optimized in multiple ways. I don’t exactly recall but it’s possible that changing from the default sort uses a merge sort instead of a quick sort. The default sort also uses integer operations and handles NaNs in a special, clever way.