Specialization tips to shrink 45x performance gap when comparing structs

milesf · September 27, 2019, 10:58pm

I’m seeking tips on how to improve performance of struct comparisons where the specific comparison rule is selected at runtime.

I have a struct with n fields, and I’d like to sort by each field in either order.

Performance with a hardcoded rule is 45x faster than my existing dynamic strategy (benchmarks below), so I’m thinking it could be worthwhile to use precompiled rules for all dynamic possibilities. There are n! * 2^n possible ordering rules, so this might become unwieldy. The example below uses n=3 (48 rules). My application requires n=10 (3715891200 or 3628800 rules)[1].

I considered using the type system to generate these specialized rules, but this might expand to n^n * 2^n rules, and seems cautioned against.

Generated functions and macros might be another path to success, but I’d like to follow the “don’t do metaprogramming” advice.

Maybe there is some middle-ground with a smaller performance penalty.

Two other other options I’m investigating are:

Give the struct a FieldVector view and create a “comparison score” by multiplying with a dynamic array of weights of unique powers-of-2 and taking the sum.
Create a dynamic array of comparison functions and loop through those. This requires n or 2n comparison functions.

I’ll post benchmark results of the above once they’re available, but figured it would be wise to ask for advice from more experienced community members in the meantime.

Here are benchmark results for the two techniques with a wide performance gap:

# ========= Struct Definition =========

using Random
using BenchmarkTools

struct MyStruct
    a::Int64
    b::Int64
    c::Int64
end

function Random.rand(rng::AbstractRNG, ::Random.SamplerType{MyStruct})
    MyStruct(
        rand(rng, Int),
        rand(rng, Int),
        rand(rng, Int),
    )
end

xs = rand(MyStruct, 1000)

# ============ Static Rule ============

function static_lt(a::MyStruct, b::MyStruct)
    a.a > b.a && return true
    a.a < b.a && return false

    a.c > b.c && return true
    a.c < b.c && return false
    
    a.b < b.b && return true
    a.b > b.b && return false

    return false
end

julia> @benchmark sort!(a, lt=static_lt) setup=(a = copy($xs))
BenchmarkTools.Trial: 
  memory estimate:  11.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     37.459 μs (0.00% GC)
  median time:      38.542 μs (0.00% GC)
  mean time:        38.913 μs (0.00% GC)
  maximum time:     67.398 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

# ========= Dynamic Rule ============

struct FieldOrder
    sym::Symbol
    ord::Base.Ordering
end

struct StructOrdering <: Base.Ordering
    field_orderings::AbstractVector{FieldOrder}
end

function Base.lt(struct_ordering::StructOrdering, a, b)
    for field_ordering in struct_ordering.field_orderings
        sym = field_ordering.sym
        ord = field_ordering.ord
        va = getproperty(a, sym)
        vb = getproperty(b, sym)
        Base.lt(ord, va, vb) && return true
        Base.lt(ord, vb, va) && return false
    end
    false # a and b are equal
end

dynamic_order = StructOrdering([
    FieldOrder(:a, Base.Reverse),
    FieldOrder(:c, Base.Reverse),
    FieldOrder(:b, Base.Forward),
    ])

julia> @benchmark sort!(a, order=$dynamic_order) setup=(a = copy($xs))
BenchmarkTools.Trial: 
  memory estimate:  1.71 MiB
  allocs estimate:  80248
  --------------
  minimum time:     1.686 ms (0.00% GC)
  median time:      1.707 ms (0.00% GC)
  mean time:        1.899 ms (8.83% GC)
  maximum time:     50.929 ms (96.21% GC)
  --------------
  samples:          2622
  evals/sample:     1

# =========== Equivalent Dynamic Rule ===========

# This dynamic technique could also be passed to sort's `lt`
function dynamic_lt(struct_ordering::StructOrdering)
    function (a::MyStruct, b::MyStruct)
        Base.lt(struct_ordering, a, b)
    end
end

julia> @benchmark sort!(a, lt=$(dynamic_lt)($dynamic_order)) setup=(a = copy($xs))
BenchmarkTools.Trial: 
  memory estimate:  1.71 MiB
  allocs estimate:  80250
  --------------
  minimum time:     1.688 ms (0.00% GC)
  median time:      1.714 ms (0.00% GC)
  mean time:        1.901 ms (8.85% GC)
  maximum time:     51.383 ms (96.25% GC)
  --------------
  samples:          2618
  evals/sample:     1

[1] My use case uses a fixed order for each field. e.g. field a always descending, field b always ascending, which eliminates the 2^n term to reduce the total possibilities to n!.

c42f · September 28, 2019, 5:35am

I’d say this is an example where some metaprogramming is warranted Obviously not to generate all the possibilities up front, but to encode the ordering in a type which the compiler can then use to create a specialized comparison function on demand. If it’s done on demand, you’re hardly likely to generate all n! possibilites (it seems unlikely to explore all those options as part of the dynamic execution of the program).

I’d probably try encoding the ordering of fields as a tuple of symbols in a type parameter of some type you define. Then implement the comparison either as a recursive inline function which reduces that tuple on each self-call (allowing the compiler to unroll the recursion fully), or as a @generated function if all else fails.

Elrod · September 28, 2019, 4:16pm

I’m sure I’m too quick to fall back to the @generated approach, but it’s easy.
General setup and static_lt

julia> using Random

julia> using BenchmarkTools

julia> struct MyStruct
           a::Int64
           b::Int64
           c::Int64
       end

julia> function Random.rand(rng::AbstractRNG, ::Random.SamplerType{MyStruct})
           MyStruct(
               rand(rng, Int),
               rand(rng, Int),
               rand(rng, Int),
           )
       end

julia> xs = rand(MyStruct, 1000);

julia> function static_lt(a::MyStruct, b::MyStruct)
           a.a > b.a && return true
           a.a < b.a && return false
       
           a.c > b.c && return true
           a.c < b.c && return false
           
           a.b < b.b && return true
           a.b > b.b && return false
       
           return false
       end
static_lt (generic function with 1 method)

Generated:

julia> struct FieldOrder{S,O}
           ord::O
       end

julia> FieldOrder{S}(f::F) where {S,F} = FieldOrder{S,F}(f)

julia> struct StructOrdering{T} <: Base.Ordering
           field_orderings::T
       end

julia> @generated function Base.lt(struct_ordering::StructOrdering{T}, a, b) where {T}
           N = length(T.parameters); #@show T.parameters[3].parameters
           q = quote fo = struct_ordering.field_orderings end
           for n in 1:N
               va = Symbol(:va_,n); vb = Symbol(:vb_,n); ord = Symbol(:ord_,n)
               push!( q.args, :($va = getproperty(a, $(QuoteNode(T.parameters[n].parameters[1])))))
               push!( q.args, :($vb = getproperty(b, $(QuoteNode(T.parameters[n].parameters[1])))))
               push!( q.args, :($ord = fo[$n].ord))
               push!( q.args, :(Base.lt($ord, $va, $vb) && return  true) )
               push!( q.args, :(Base.lt($ord, $vb, $va) && return false) )
           end    
           push!(q.args, false) # a and b are equal
           q
       end
       
julia> dynamic_order = StructOrdering((
           FieldOrder{:a}(Base.Reverse),
           FieldOrder{:c}(Base.Reverse),
           FieldOrder{:b}(Base.Forward)
           ));

Benchmarks the same (because I tried to just create the static_lt code):

julia> @benchmark sort!(a, order=$dynamic_order) setup=(a = copy($xs))
BenchmarkTools.Trial: 
  memory estimate:  11.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     15.143 μs (0.00% GC)
  median time:      16.379 μs (0.00% GC)
  mean time:        16.625 μs (0.56% GC)
  maximum time:     265.629 μs (90.63% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark sort!(a, lt=static_lt) setup=(a = copy($xs))
BenchmarkTools.Trial: 
  memory estimate:  11.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     15.163 μs (0.00% GC)
  median time:      16.337 μs (0.00% GC)
  mean time:        16.650 μs (0.56% GC)
  maximum time:     263.680 μs (90.19% GC)
  --------------
  samples:          10000
  evals/sample:     4

milesf · September 29, 2019, 6:15pm

Thanks @Elrod. Very happy to see this work without any noticeable performance penalty.

@c42f I also appreciate your suggestion. I got one version of this implemented, but it’s not as clean as I suspect it could be. Do you have any suggestions on how I can clean-up my code to accomplish following in a more elegant way, or is this problem best solved with generated functions?

Remove code duplication
Handle an empty StructOrdering
Exactly match performance of the static and generated strategies. Benchmarks consistently show a very slight (6%) performance penalty.

## ----------- Setup -----------
using Random
using BenchmarkTools

struct MyStruct
    a::Int64
    b::Int64
    c::Int64
end

function Random.rand(rng::AbstractRNG, ::Random.SamplerType{MyStruct})
    MyStruct(
        rand(rng, Int),
        rand(rng, Int),
        rand(rng, Int),
    )
end

xs = rand(MyStruct, 1000)

## ---------- Static comparison for refrence ---------
function static_lt(a::MyStruct, b::MyStruct)
    a.a > b.a && return true
    a.a < b.a && return false

    a.c > b.c && return true
    a.c < b.c && return false

    a.b < b.b && return true
    a.b > b.b && return false

    return false
end

## ---------- Reusable piece from Elrod's solution ----------
struct FieldOrder{S,O}
    ord::O
end

FieldOrder{S}(f::F) where {S,F} = FieldOrder{S,F}(f)

struct StructOrdering{T} <: Base.Ordering
    field_orderings::T
end

dynamic_order = StructOrdering((
    FieldOrder{:a}(Base.Reverse),
    FieldOrder{:c}(Base.Reverse),
    FieldOrder{:b}(Base.Forward)
));

## ---------- Attempting Chris Foster's suggestion ---------
@inline function recurse_lt(a, b, fo::FieldOrder{S,O}, rem::StructOrdering{T}) where {S,O,T}
    ord = fo.ord
    va = getproperty(a, S)
    vb = getproperty(b, S)
    Base.lt(ord, va, vb) && return true
    Base.lt(ord, vb, va) && return false
    return recurse_lt(a, b, rem.field_orderings[1], StructOrdering(Base.tail(rem.field_orderings)))
end

# Contains the base case along with duplicated code
@inline function recurse_lt(a, b, fo::FieldOrder{S,O}) where {S,O}
    ord = fo.ord
    va = getproperty(a, S)
    vb = getproperty(b, S)
    Base.lt(ord, va, vb) && return true
    Base.lt(ord, vb, va) && return false
    return false
end

function Base.lt(struct_ordering::StructOrdering{T}, a, b) where {T}
    fos = struct_ordering.field_orderings
    # todo, need to account for empty order
    return recurse_lt(a, b, fos[1], StructOrdering(Base.tail(fos)))
end

## ----------- Benchmarks -------------
julia> @benchmark sort!(a, order=dynamic_order) setup=(a = copy($xs))
BenchmarkTools.Trial: 
  memory estimate:  11.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     37.997 μs (0.00% GC)
  median time:      39.267 μs (0.00% GC)
  mean time:        39.958 μs (0.00% GC)
  maximum time:     104.144 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark sort!(a, lt=static_lt) setup=(a = copy($xs))
BenchmarkTools.Trial: 
  memory estimate:  11.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     35.685 μs (0.00% GC)
  median time:      36.807 μs (0.00% GC)
  mean time:        37.369 μs (0.00% GC)
  maximum time:     96.137 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

c42f · September 30, 2019, 4:25am

Here’s slightly modified version of your recursive code but without the code duplication. This follows the general pattern I’d use for this kind of thing:

struct FieldOrder{S,O}
    ord::O
end

FieldOrder{S}(f::F) where {S,F} = FieldOrder{S,F}(f)

struct StructLt{T} 
    field_orderings::T
end

@inline recurse_lt(a, b) = false # leq
@inline function recurse_lt(a, b, fo::FieldOrder{S}, rem...) where {S}
    va = getproperty(a, S)
    vb = getproperty(b, S)
    ord = fo.ord
    Base.lt(ord, va, vb) && return true
    Base.lt(ord, vb, va) && return false
    return recurse_lt(a, b, rem...)
end

function (ordering::StructLt)(a, b)
    return recurse_lt(a, b, ordering.field_orderings...)
end

dynamic_lt = StructLt((
    FieldOrder{:a}(Base.Reverse),
    FieldOrder{:c}(Base.Reverse),
    FieldOrder{:b}(Base.Forward)
));

xs = rand(MyStruct, 1000)

@btime sort!(a, lt=dynamic_lt) setup=(a = copy($xs));

Unfortunately that this is quite a lot slower than static_lt (~50% or so). If you look at the @code_typed, you’ll see that the recursion is correctly inlined and unrolled so it appears we’ve told type inference enough. However the control flow graph is quite a lot different from static_lt and this is true even for sorting based on a single field (note the presence of the φ node).

julia> dlt = StructLt((FieldOrder{:a}(Base.Forward),));

julia> @code_typed optimize=true dlt(xs[1], xs[2])
CodeInfo(
1 ─ %1  = Base.getfield(a, :a)::Int64
│   %2  = Base.getfield(b, :a)::Int64
│   %3  = Base.slt_int(%1, %2)::Bool
└──       goto #3 if not %3
2 ─       goto #6
3 ─ %6  = Base.slt_int(%2, %1)::Bool
└──       goto #5 if not %6
4 ─       goto #6
5 ─       goto #6
6 ┄ %10 = φ (#2 => true, #4 => false, #5 => false)::Bool
└──       return %10
) => Bool

Comparing to a direct call of recurse_lt shows that it’s probably something to do with unpacking ordering.field_orderings inside the StructOrdering functor call, because the following looks more like you might expect it to:

julia> @code_typed optimize=true recurse_lt(xs[1], xs[2], dlt.field_orderings...)
CodeInfo(
1 ─ %1 = $(Expr(:static_parameter, 1))::Core.Compiler.Const(:a, false)
│   %2 = Base.getfield(a, %1)::Int64
│   %3 = $(Expr(:static_parameter, 1))::Core.Compiler.Const(:a, false)
│   %4 = Base.getfield(b, %3)::Int64
│   %5 = Base.slt_int(%2, %4)::Bool
└──      goto #3 if not %5
2 ─      return true
3 ─ %8 = Base.slt_int(%4, %2)::Bool
└──      goto #5 if not %8
4 ─      return false
5 ─      return false
) => Bool

(note no phi node here; instead a direct return is used).

For more fields it seems that the LLVM optimization passes aren’t able to unscramble this control flow and efficiency suffers somewhat; there might be a workaround but I’m not sure what it is.

In these situations it can be easier to just use a @generated function and not worry too much about it

milesf · September 30, 2019, 5:04am

Thanks for the analysis. This is a helpful example for discovering new techniques. Learned the following:

Did not expect that rem... could be untyped.
Did not know that a struct instance can be called as a function too.
Still need to do some more reading on the implications of phi nodes. I assume the answer can be found here.

c42f · September 30, 2019, 6:01am

If you’re interested in how the compiler currently works and what you could do to convince it to generate better code, then reading up on the SSA form IR could be interesting (see also the devdocs here: Julia SSA-form IR · The Julia Language). Do keep in mind that this is an implementation detail of the compiler so it will probably require quite a deep dive into the actual code to understand what’s going on. It depends on your appetite for reading and puzzling over a lot of code as to whether that’s worthwhile

c42f · September 30, 2019, 6:24am

Yes, types in function signatures are for controlling dispatch not specialization. If there’s only one method for a function you can just leave all the types unspecified. The compiler will still figure out the specific type for you at runtime and generate specialized code for that.

Topic		Replies	Views
Unexpected performance outcome. Does accessing struct members cause allocation? General Usage performance	18	284	October 16, 2024
Efficient reflection on structs Performance	54	2184	February 17, 2021
How is it possible that parallelized code causes fewer allocations? Performance parallel , sort , memory-allocation	19	383	December 30, 2024
Some notes on performance of callable structs Performance	17	1012	December 30, 2023
On the performance of function calls that depends on a variable Performance metaprogramming	12	953	February 24, 2021

Specialization tips to shrink 45x performance gap when comparing structs

Related topics