Performance of custom lt function in sort!

I am interested in writing my own lt function within a sort for better performance in a specific use case (in this case, I only care if the numbers are a set length apart from one another). However, it seems that any substitution of the lt function besides isless decreases performance.

Take this simple example:

@benchmark sort!(X, lt = isless) setup = (X = randn(1_000_000))

This benchmarks as

BenchmarkTools.Trial: 65 samples with 1 evaluation.
 Range (min … max):  73.737 ms …  75.956 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     74.748 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   74.770 ms ± 469.959 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▄▁▁▁    ▄  ██ ▄ ▄▁▁▁▁▁▄      ▁                  
  ▆▁▆▁▆▆▁▁▁▁▁▁▆▁████▆▁▆▆█▆▆██▁█▆███████▁▆▆▆▆▆█▆▁▆▁▆▁▁▁▁▆▆▁▆▁▁▆ ▁
  73.7 ms         Histogram: frequency by time         75.8 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

However, the simple substitution can be made as:

new_lt(x::AbstractFloat, y::AbstractFloat) = x < y
@benchmark sort!(X, lt = new_lt) setup = (X = randn(1_000_000))

This should do the same thing, I think. If anything, it should bypass checks related to isnan. However, the performance decreases, and the above code benchmarks as follows:

BenchmarkTools.Trial: 56 samples with 1 evaluation.
 Range (min … max):  86.292 ms …  90.038 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     87.803 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   87.871 ms ± 885.583 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁   ▁       ▁▁▁       ▁ ▄   ▁    ▄  ▁            ▁█           
  █▁▁▁█▁▁▆▆▆▁▁███▆▆▆▆▆▆▁█▆█▁▁▆█▆▁▆▁█▆▆█▆▆▆▁▆▁▁▆▆▆▁▁██▁▆▆▆▁▁▁▆▆ ▁
  86.3 ms         Histogram: frequency by time         89.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

I’m curious what the difference is here, and if there’s any way around it?

what’s more surprising is that < for Float64 is just Base.lt_float yet:

julia> @benchmark sort!(X, lt = Base.lt_float) setup = (X = randn(1_000_000))
BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  1.125 s …   1.264 s  ┊ GC (min … max): 1.53% … 9.27%
 Time  (median):     1.131 s              ┊ GC (median):    1.52%
 Time  (mean ± σ):   1.163 s ± 59.183 ms  ┊ GC (mean ± σ):  3.20% ± 3.47%

  ███               █                                     █  
  ███▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.12 s         Histogram: frequency by time        1.26 s <

 Memory estimate: 1.02 GiB, allocs estimate: 68303322.

this is still possible if your “metric” is special, but it’s hard to beat what’s in Base if you’re also comparing by number size, since it’s optizmied (I assume) already.

2 Likes

The default sort order is heavily optimized in multiple ways. I don’t exactly recall but it’s possible that changing from the default sort uses a merge sort instead of a quick sort. The default sort also uses integer operations and handles NaNs in a special, clever way.

4 Likes