Contradictory nested function performance and allocation

Recently when working on sorting data with Hilbert curves, I noticed something very peculiar when benchmarking.

function hilbert_sort(m::AbstractMatrix{<:Integer}, b::Integer; rev::Bool = false)
    return m[:, hilbert_sortperm(m, b; rev)]
end

function hilbert_sort!(m::AbstractMatrix{<:Integer}, b::Integer; rev::Bool = false)
    return Base.permutecols!!(m, hilbert_sortperm(m, b; rev))
end

function hilbert_sortperm(m::AbstractMatrix{<:Integer}, b::Integer; rev::Bool = false)
    h = mapslices(__with_bit(b), m; dims = 1) |> vec

    return sortperm(h; rev)
end

function hilbert_sortperm!(m::AbstractMatrix{<:Integer}, b::Integer; rev::Bool = false)
    h = eachcol(m) .|> __with_bit(b)

    return sortperm(h; rev)
end

Where __with_bit is a function factory that returns an expensive-ish computation function (Skilling’s method returning a distance) that mutates. Naturally, I would expect the performance of hilbert_sortperm! > hilbert_sortperm > hilbert_sort! > hilbert_sort. However, the result is the last I would expect:

Large m benchmark
julia> m = rand(Bool, (10, 1000))
10Γ—1000 Matrix{Bool}:
[...]

julia> @benchmark hilbert_sort($(copy(m)), $1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  56.600 ΞΌs …  16.335 ms  β”Š GC (min … max): 0.00% … 99.36%
 Time  (median):     62.400 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   68.468 ΞΌs Β± 163.403 ΞΌs  β”Š GC (mean Β± Οƒ):  2.37% Β±  0.99%

   β–‡β–„β–β–‡β–ˆβ–ƒβ–
  β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–…β–„β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  56.6 ΞΌs         Histogram: frequency by time          116 ΞΌs <

 Memory estimate: 33.89 KiB, allocs estimate: 23.

julia> @benchmark hilbert_sort!($(copy(m)), $1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  26.300 ΞΌs … 926.400 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     29.000 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   30.798 ΞΌs Β±  11.887 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–‡β–ˆβ–‡β–‡β–‡β–…β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–β–                                            β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–‡β–†β–†β–‡β–†β–…β–„β–„β–ƒβ–„β–„β–β–ƒβ–…β–ƒβ–β–β–ƒβ–ƒβ–„β–β–†β–‡β–†β–‡β–‡β–‡β–‡β–‡β–†β–†β–‡β–†β–…β–…β–†β–†β–† β–ˆ
  26.3 ΞΌs       Histogram: log(frequency) by time      68.2 ΞΌs <

 Memory estimate: 16.11 KiB, allocs estimate: 17.

julia> @benchmark hilbert_sortperm($(copy(m)), $1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  52.600 ΞΌs …  15.688 ms  β”Š GC (min … max): 0.00% … 99.41%
 Time  (median):     56.900 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   61.121 ΞΌs Β± 156.932 ΞΌs  β”Š GC (mean Β± Οƒ):  2.55% Β±  0.99%

   β–…β–ˆβ–ƒβ–†β–ƒβ–
  β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–…β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  52.6 ΞΌs         Histogram: frequency by time          103 ΞΌs <

 Memory estimate: 23.99 KiB, allocs estimate: 20.

julia> @benchmark hilbert_sortperm!($(copy(m)), $1)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  72.700 ΞΌs …  11.514 ms  β”Š GC (min … max): 0.00% … 99.06%
 Time  (median):     76.700 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   80.416 ΞΌs Β± 115.160 ΞΌs  β”Š GC (mean Β± Οƒ):  1.42% Β±  0.99%

    β–ƒβ–ˆβ–…β–β–
  β–β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  72.7 ΞΌs         Histogram: frequency by time          119 ΞΌs <

 Memory estimate: 23.65 KiB, allocs estimate: 9.

While the performance and allocation of hilbert_sort! beats hilbert_sort as expected, the rest is puzzling.

Although hilbert_sortperm! allocates 9 times instead of 20 times as in hilbert_sortperm, the memory allocation is roughly the same, and the performance is drastically worse. When I tested with smaller m, this result is then reversed:

Small m benchmark
julia> m = rand(Bool, (10, 10))
10Γ—10 Matrix{Bool}:
[...]

julia> @benchmark hilbert_sortperm($(copy(m)), $1)
BenchmarkTools.Trial: 10000 samples with 159 evaluations per sample.
 Range (min … max):  664.780 ns … 303.486 ΞΌs  β”Š GC (min … max): 0.00% … 99.56%
 Time  (median):     782.390 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   857.347 ns Β±   4.029 ΞΌs  β”Š GC (mean Β± Οƒ):  7.89% Β±  1.72%

   β–ˆ
  β–ˆβ–ˆβ–†β–„β–…β–…β–ƒβ–‚β–β–β–ƒβ–ƒβ–ƒβ–„β–ƒβ–ƒβ–ƒβ–„β–„β–…β–†β–‡β–…β–„β–„β–ƒβ–ƒβ–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  665 ns           Histogram: frequency by time         1.21 ΞΌs <

 Memory estimate: 640 bytes, allocs estimate: 15.

julia> @benchmark hilbert_sortperm!($(copy(m)), $1)
BenchmarkTools.Trial: 10000 samples with 258 evaluations per sample.
 Range (min … max):  296.899 ns … 164.754 ΞΌs  β”Š GC (min … max): 0.00% … 99.65%
 Time  (median):     366.279 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   398.245 ns Β±   1.924 ΞΌs  β”Š GC (mean Β± Οƒ):  6.62% Β±  1.41%

  β–‚β–„β–…β–†β–…β–‚β–‚β–‚   β–…β–ˆβ–‡β–†β–†β–…β–…β–„β–„β–„β–ƒβ–‚β–‚β–‚β–β–                                   β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–…β–…β–…β–…β–…β–‚β–„β–„β–„β–…β–„β–„β–„β–„β–†β–„β–„β–†β–ˆβ–†β–„β–…β–…β–…β–„β–ƒβ–„β–„β–… β–ˆ
  297 ns        Histogram: log(frequency) by time        621 ns <

 Memory estimate: 288 bytes, allocs estimate: 4.

Apart from that, hilbert_sort!, which calls hilbert_sortperm, somehow beats the latter in both performance and allocation. I tried using JET to detect type instabilities but no errors. I also tried using a simplified scenario:

function some_sort!(m::AbstractMatrix{<:Integer})
    return Base.permutecols!!(m, some_sortperm(m))
end

function some_sortperm(m::AbstractMatrix{<:Integer})
    return mapslices(example!, m; dims = 1) |> vec |> sortperm
end

function example!(x::AbstractVector{<:Integer})
    for _ in 1:10 # expensive-ish
        x .⊻= x[end] 
    end
    return sum(x)
end

But cannot replicate the peculiar result:

Simplified benchmark
julia> m = rand(Bool, (10, 1000))
10Γ—1000 Matrix{Bool}:
[...]

julia> @benchmark some_sort!($(copy(m)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  66.900 ΞΌs … 417.600 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     68.400 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   70.269 ΞΌs Β±  10.250 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‡β–ˆβ–‡β–†β–†β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–β–β–                                             β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–…β–…β–…β–„β–…β–…β–…β–„β–„β–ƒβ–„β–…β–„β–„β–β–„β–β–„β–β–„β–„β–β–β–…β–†β–‡β–ˆβ–ˆβ–†β–†β–„β–…β–…β–„β–„β–…β–…β–„ β–ˆ
  66.9 ΞΌs       Histogram: log(frequency) by time       105 ΞΌs <

 Memory estimate: 16.25 KiB, allocs estimate: 19.

julia> @benchmark some_sortperm($(copy(m)))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  64.100 ΞΌs … 922.100 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     68.000 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   70.893 ΞΌs Β±  18.556 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–‡β–ˆβ–‡β–‡β–‡β–‡β–†β–„β–ƒβ–‚β–‚β–‚β–β–                                              β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–‡β–…β–…β–…β–…β–„β–…β–…β–ƒβ–„β–„β–…β–„β–ƒβ–‡β–†β–ˆβ–‡β–‡β–†β–‡β–†β–‡β–†β–‡β–†β–†β–†β–‡β–‡β–†β–†β–†β–†β–…β–…β–†β–†β–†β–†β–†β–… β–ˆ
  64.1 ΞΌs       Histogram: log(frequency) by time       120 ΞΌs <

 Memory estimate: 16.25 KiB, allocs estimate: 19.

This really left me puzzled. I am still new to Julia so any help would be greatly appreciated!

This stood out to me, and turns out it’s not a public function to begin with. It appears to mutate both inputs and return the 1st, so your benchmark is mutating the input copy(m) every call. That usually makes for very unrepresentative benchmarks, especially when it converges on a result and do a lot less work in the long run. Sorting tends to do that. The setup option of @benchmark can reset the input per sample, but each sample may have many calls, so you’ll want to also manually set the evals=1 option (which happens to match your benchmark anyway). This may not be the reason, but fixing a mutating benchmark is a start.

It’s not unusual for different iteration mechanisms e.g. map vs broadcast to have different performance, but I’d check if the indirection of .|> is making a difference. That’s not syntactic sugar for piping, it’s an actual higher-order function being broadcasted. Make a variant with the more conventional h = (__with_bit(b)).(eachcol(m)) and benchmark both hilbert_sortperm! and the variant back to back.

2 Likes

Thank you very much! Indeed, I seem to have misunderstood the usage of $ interpolation and though it re-evaluates the expression everytime it’s ran. Now I am able to get this:

New benchmark
julia> const m = rand(Bool, (10, 1000))
10Γ—1000 Matrix{Bool}:
[...]  

julia> @benchmark hilbert_sort(n, $1) setup=(n = copy($m)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  59.000 ΞΌs …  6.247 ms  β”Š GC (min … max): 0.00% … 98.24%
 Time  (median):     65.500 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   74.022 ΞΌs Β± 97.924 ΞΌs  β”Š GC (mean Β± Οƒ):  6.20% Β±  6.54%

  β–„β–…β–…β–†β–ˆβ–ˆβ–‡β–†β–†β–…β–…β–„β–„β–ƒβ–‚β–β–           ▁▁▂▂▂▁▂▂▂▁▁                     β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡β–‡β–…β–…β–†β–…β–ƒβ–…β–„β–‚β–…β–ƒβ–„β–„β–ƒβ–„ β–ˆ
  59 ΞΌs        Histogram: log(frequency) by time       123 ΞΌs <

 Memory estimate: 33.89 KiB, allocs estimate: 23.

julia> @benchmark hilbert_sort!(n, $1) setup=(n = copy($m)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  64.200 ΞΌs …  5.999 ms  β”Š GC (min … max): 0.00% … 98.12%
 Time  (median):     70.200 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   75.020 ΞΌs Β± 80.601 ΞΌs  β”Š GC (mean Β± Οƒ):  1.69% Β±  1.68%

   β–‚β–„β–‡β–ˆβ–ˆβ–‡β–‡β–†β–†β–†β–…β–…β–…β–…β–„β–ƒβ–ƒβ–‚β–β–                 ▁▁▁▁▁ ▁▁▁▁▁▁▁         β–ƒ
  β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–…β–…β–…β–†β–„β–„β–†β–„β–…β–†β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–†β–†β–† β–ˆ
  64.2 ΞΌs      Histogram: log(frequency) by time       112 ΞΌs <

 Memory estimate: 23.99 KiB, allocs estimate: 20.

julia> @benchmark hilbert_sortperm(n, $1) setup=(n = copy($m)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  55.600 ΞΌs …  4.894 ms  β”Š GC (min … max): 0.00% … 98.30%
 Time  (median):     61.100 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   67.047 ΞΌs Β± 85.246 ΞΌs  β”Š GC (mean Β± Οƒ):  2.65% Β±  2.17%

  β–‚β–ƒβ–ƒβ–†β–ˆβ–‡β–†β–†β–…β–…β–…β–„β–„β–„β–‚β–ƒβ–‚β–                        ▁▁▁▂▂▁            β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–†β–†β–…β–‚β–‚β–…β–†β–‡β–‡β–†β–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–†β–‡β–†β–…β–„β–… β–ˆ
  55.6 ΞΌs      Histogram: log(frequency) by time       107 ΞΌs <

 Memory estimate: 23.99 KiB, allocs estimate: 20.

julia> @benchmark hilbert_sortperm!(n, $1) setup=(n = copy($m)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  53.900 ΞΌs …   8.961 ms  β”Š GC (min … max): 0.00% … 98.91%
 Time  (median):     61.000 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   65.985 ΞΌs Β± 108.955 ΞΌs  β”Š GC (mean Β± Οƒ):  3.71% Β±  3.16%

      β–„β–†β–ˆβ–‡β–ƒβ–…β–‚β–‚
  β–β–„β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–†β–…β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–‚β–‚β–‚β–β–‚β–‚β–β–β–‚β–β–β–β–β–β– β–ƒ
  53.9 ΞΌs         Histogram: frequency by time         99.7 ΞΌs <

 Memory estimate: 23.65 KiB, allocs estimate: 9.

Which is exactly what I expect. Thank you for helping!