Hashing for big structs is slow - any alternative?

I have three structs, one mutable and two immutable, defined as:

mutable struct Node
    label::NodeEdge
    comp::Vector{UInt8}       
    comp_weights::Vector{UInt8} 
    cc::UInt8                   
    fps::Vector{ForbiddenPair}
    comp_assign::Vector{UInt8}  
end

struct ForbiddenPair
    comp₁::UInt8
    comp₂::UInt8
end

struct NodeEdge
    edge₁::UInt8
    edge₂::UInt8
end

In my profiling, the bottleneck in multiple parts of the program seems to be the hash() function for this struct, which I define as

Base.hash(n::Node, h::UInt) = hash(n.label, hash(n.comp, hash(n.cc, hash(n.fps, hash(n.comp_weights, hash(n.comp_assign, hash(:Node, h)))))))
Base.hash(fp::ForbiddenPair, h::UInt) = hash(fp.comp₁, hash(fp.comp₂, hash(:ForbiddenPair, h)))

How can I speed hashing up, or is that an inevitable slowup for a struct of this size? None of the fields can be dropped.

Any other performance pointers or suggestions would also be very appreciated!! :slight_smile:

Is your label unique?
If yes, then only use label as input to the hash function.

My guess on a quick REPL try is, that hashing a vector depends on the elements. Therefore the hashing time grows with the number of elements.

julia> x = Vector{UInt8}(undef, Int(1e7));

julia> @time hash(x)
  0.013269 seconds (1 allocation: 16 bytes)
0x30af4dd68158ac9b

julia> x = Vector{UInt8}(undef, Int(1e8));

julia> @time hash(x)
  0.075920 seconds (1 allocation: 16 bytes)
0xc9ff726914688351

julia> x = Vector{UInt8}(undef, Int(1e9));

julia> @time hash(x)
  0.701304 seconds (1 allocation: 16 bytes)
0xb16f17b4bd4d9b7b

julia> x = Vector{UInt8}(undef, Int(1e10));

julia> @time hash(x)
  7.024885 seconds (1 allocation: 16 bytes)
0xf74daa50de3a895a
1 Like

Can any of the fields that are Vector{UInt8} be sampled very roughly? eg:

hash2(vec::Vector{UInt8}) =
  hash(v[end], v[1] % UInt64))

hash3(vec::Vector{UInt8}) =
  hash(v[end], hash(v[end>>1], v[1] % UInt64))

hash4(vec::Vector{UInt8}) =
  hash(v[end], hash(v[end-1], hash(v[2], v[1] % UInt64)))

hash4(vec::Vector{UInt8}) =
  hash(v[end], hash(v[end>>1], hash(v[end>>2], v[1] % UInt64)))
1 Like

Just to make explicit the idea behind the previous answers: you are free to do anything with hash as long as the invariant “a == b implies hash(a) == hash(b)” is maintained. So you can define hash to return a constant number, this is very fast to compute but will result in collisions (this is inefficient) when objects are stored in a Set. So the idea is to find a tradeoff such that hash is reasonably fast while limiting the number of collisions (i.e. we want hash(a) != hash(b) when a != b as much as possible).

3 Likes

What are you calling hash for? What are you using the result for?

I use hash(Node) as keys to my Dict, which is of the form Dict{UInt64, Int64}. My dictionary gets gigantic and I was hitting RAM issues so I resorted to using the hash() of the Nodes as keys instead, which works for my use case. However, this means that everytime I add to the dictionary I have to hash() my new Node, and likewise for everytime I want to look up a value.

Maybe you can use an IdDict instead?

1 Like

Thank you for the excellent suggestion – but my understanding of IdDict is that it uses the === operator i.e that you want to have dict by with unique keys by object identity instead of value equality, which is not the case for me. My Node objects are different but I want uniqueness to be defined by value.

Maybe caching the hash in the object itself and invalidate/update when the stored values change?

3 Likes

Don’t we have basically three options?

  1. You pay the price of completely hashing the full list

  2. You use the label as hash? (I mean, what’s the point of having a label if it’s not unique?)

  3. What @tisztamo suggested.

Thank you so much for everyone’s help! What ended up working for me was @tisztamo 's suggestion - storing the hash of the Node struct as a field inside Node itself, and updating it when the values changed.

I am also using a Set to keep my Nodes and this solution entailed changing the hashindex function in Dict (because Sets are implemented as Dicts under the hood) from this:

hashindex(key, sz) = (((hash(key)%Int) & (sz-1)) + 1)::Int

to this:

import Base: Dict
hashindex(node::Node, sz) = (((node.hash%Int) & (sz-1)) + 1)::Int

that is to say, I had to rewrite this function to check for the hash field instead of recomputing it.

This sped me up tons. Thanks!!

4 Likes

How do you detect when elements of the array fields change so you can update the cached hash value?

1 Like

Yeah, cache invalidation is hard…

https://www.karlton.org/2017/12/naming-things-hard/

I think there is no simple and general solution for this, at least if you want to allow manipulating the content from the “outside”, or you have a deeply nested structure.

But in this concrete case it seems possible to create a custom array type with overloaded setindex!, that either notifies the container to invalidate the hash-cache, or calculates the diff of the hash and updates the cache - which one is better depends on updating patterns, I think.

1 Like