Custom optimized hash tables with UInt64 keys, lower level than Dict for ultra speed

In Metatheory.jl we have e-graphs, which are basically made of some Dicts. We are encountering some performance bottlenecks because of the implementation of Dict, and we realized that optimized data structures could go a lot faster.

for example we have

const Id = UInt64

struct EClass{D}

struct EGraph 

It should always be that egraph.classes[x].id == x.

Since this data structure is basically accessed everywhere, hashing of UInt64 is becoming a bottleneck!!

There’s only 3 places where we perform operations that are not getindex on classes:

  • Insertion of a new eclass. The id is fresh (incremented by 1) and is not replacing any old element in the dict.
  • Merging of 2 eclasses. this is why we are not using a vector!! because the length of the classes dict might be a lot less than the largest id inserted. Given 2 ids, id_1 and id_2, we remove the element for key id_2 from the classes dict, and mutate the value in id_1.
  • is updated. a value for an id is replaced by a new one since EClass is immutable. (can be made mutable).

We are trying to reduce the time required by getindex(g.classes, id::UInt64). Since we’re already accessing the dict via an UInt64, is there really the need for hashing?
The entirety of the algorithm does a LOT of accesses to the classes Dict (and other Dicts) and we’re trying to kill the time (and allocations) required by accessing this Dict.

It can be as low level as possible. +1 if anyone knows something that avoids the garbage collector, since stuff should be freed when removed from the Dict.

I’ve implemented the trick described in Dictionary with custom hash function - #6 by Oscar_Smith

There's no need of computing hash for dictionaries where keys are UInt64.
Wrap them in an immutable struct that overrides `hash`.

TODO: this is rather hacky. We need a more performant dict implementation.

Trick from:
struct IdKey
Base.hash(a::IdKey, h::UInt) = xor(a.val, h)
Base.:(==)(a::IdKey, b::IdKey) = a.val == b.val

Since that discussion links to [WIP/RFC] broaden the scope of `hash` by rfourquet · Pull Request #37964 · JuliaLang/julia · GitHub - a POC of custom hash types in dictionaries, and introducing an extra struct is rather tricky, are you aware of any other approach that solves this issue? Basically all of our hashes are already cached before dictionary accesses.

We just need the fastest possible UInt64 → UInt64 dictionary ever.

1 Like

A MWE with profiler flamegraph would be nice.

Why UInt64? Why not a narrower type?

Have you considered using a concrete type instead of a Union? Maybe SumTypes.jl?

TBH that’s an issue with the design of your structs. Obviously Id should have been a type parameter instead of a global name. Better fix it sooner rather than later.

I haven’t done any benchmarking, but maybe something like the data structure used in slotmap - Rust would have less overhead?

You can easily define a custom hash function that only does xor: Dictionary with custom hash function - #4 by stevengj

1 Like

A MWE with profiler flamegraph would be nice.

On branch 3.0 Release by 0x0f0f0f · Pull Request #185 · JuliaSymbolics/Metatheory.jl · GitHub

using Metatheory, BenchmarkTools

t = @theory a b begin
  a + b --> b + a
  a * b --> b * a
  a + 0 --> a
  a * 0 --> 0
  a * 1 --> a

using BenchmarkTools

p = SaturationParams(; timer = false)

function simpl(ex)
  g = EGraph(ex)
  saturate!(g, t, p)
  extract!(g, astsize)

ex = :(0 + (1 * foo) * 0 + (a * 0) + a)


@profview simpl(ex)

Separating out the troublesome parts of the algorithm doesn’t really make sense as of now because it’s a quite complex system that involves compiled pattern matching and I’m benchmarking all the components together.

Why UInt64? Why not a narrower type?

It’s a trick to pack everything efficiently in Vector{UInt64}.

const Id = UInt64

    const VecExpr = Vector{Id}

An e-node is a `Vector{Id}` where:
* Position 1 stores the hash of the `VecExpr`.
* Position 2 stores the bit flags (`isexpr` or `iscall`).
* Position 3 stores the signature
* Position 4 stores the hash of the `head` (if `isexpr`) or node value in the e-graph constants.
* The rest of the positions store the e-class ids of the children nodes.
const VecExpr = Vector{Id}

This allowed for an extreme speedup compared to using structs. See this file This is also because all of these values also need to be pushed to and popped from a buffer hundreds of thousands (if not millions) of times. This by now is restricting the library to work only on systems where hashes are UInt64 but that’s not much of an issue, as everything is vectorized as bytes.

TBH that’s an issue with the design of your structs. Obviously Id should have been a type parameter instead of a global name. Better fix it sooner rather than later.

I see what you mean, but considering how VecExpr are used all around, and that the substantial speedup came by packing things in contiguous segments of memory, I’m not really sure if that is doable. It’s should be as low level, close to bytes and memory as possible.

Well, Metatheory.jl is basically an extended rewrite of GitHub - egraphs-good/egg: egg is a flexible, high-performance e-graph library from Rust in Julia, integrating with all our nice language features and Symbolics.jl / SymbolicUtils.jl compatible rewrite rule language and term interface (TermInterface.jl) :slight_smile:

See the preprint for version 3.0

We do have automatic benchmarks that run against the Rust implementation 3.0 Release by 0x0f0f0f · Pull Request #185 · JuliaSymbolics/Metatheory.jl · GitHub

You can see that we’re pretty close to Rust in performance, but some improvements can still be achieved.

Benchmark Results

egg-sym egg-cust MT@572e2f625bf… MT@master egg-sym/MT@572… egg-cust/MT@57… MT@master/MT@5…
egraph_addexpr 1.46 ms 5.18 ms 16 ms 0.281 3.08
basic_maths_simpl2 14.1 ms 5.11 ms 22.2 ms 1.07e+03 ms 0.637 0.23 48.3
prop_logic_freges_theorem 2.67 ms 1.56 ms 2.5 ms 38.8 ms 1.07 0.624 15.5
calc_logic_demorgan 62.3 μs 34.4 μs 75.7 μs 474 μs 0.824 0.455 6.26
calc_logic_freges_theorem 6.28 ms 2.87 ms
basic_maths_simpl1 6.78 ms 2.81 ms 5.65 ms 57.5 ms 1.2 0.497 10.2
egraph_constructor 0.0882 μs 0.739 μs 0.599 μs 0.119 0.81
prop_logic_prove1 37.4 ms 14.3 ms 73.8 ms 8.88e+03 ms 0.508 0.193 120
prop_logic_demorgan 82.1 μs 44.9 μs 100 μs 821 μs 0.821 0.449 8.21
while_superinterpreter_while_10 17.1 ms 98 ms 5.74
prop_logic_rewrite 110 μs 123 μs 1.12
time_to_load 70.9 ms 101 ms 1.43

Benchmarks are here Metatheory.jl/benchmark/benchmarks.jl at ale/3.0 · JuliaSymbolics/Metatheory.jl · GitHub