Hashing for big structs is slow - any alternative?

bsuwal · January 23, 2021, 12:42am

I have three structs, one mutable and two immutable, defined as:

mutable struct Node
    label::NodeEdge
    comp::Vector{UInt8}       
    comp_weights::Vector{UInt8} 
    cc::UInt8                   
    fps::Vector{ForbiddenPair}
    comp_assign::Vector{UInt8}  
end

struct ForbiddenPair
    comp₁::UInt8
    comp₂::UInt8
end

struct NodeEdge
    edge₁::UInt8
    edge₂::UInt8
end

In my profiling, the bottleneck in multiple parts of the program seems to be the hash() function for this struct, which I define as

Base.hash(n::Node, h::UInt) = hash(n.label, hash(n.comp, hash(n.cc, hash(n.fps, hash(n.comp_weights, hash(n.comp_assign, hash(:Node, h)))))))
Base.hash(fp::ForbiddenPair, h::UInt) = hash(fp.comp₁, hash(fp.comp₂, hash(:ForbiddenPair, h)))

How can I speed hashing up, or is that an inevitable slowup for a struct of this size? None of the fields can be dropped.

Any other performance pointers or suggestions would also be very appreciated!!

roflmaostc · January 23, 2021, 8:27am

Is your label unique?
If yes, then only use label as input to the hash function.

My guess on a quick REPL try is, that hashing a vector depends on the elements. Therefore the hashing time grows with the number of elements.

julia> x = Vector{UInt8}(undef, Int(1e7));

julia> @time hash(x)
  0.013269 seconds (1 allocation: 16 bytes)
0x30af4dd68158ac9b

julia> x = Vector{UInt8}(undef, Int(1e8));

julia> @time hash(x)
  0.075920 seconds (1 allocation: 16 bytes)
0xc9ff726914688351

julia> x = Vector{UInt8}(undef, Int(1e9));

julia> @time hash(x)
  0.701304 seconds (1 allocation: 16 bytes)
0xb16f17b4bd4d9b7b

julia> x = Vector{UInt8}(undef, Int(1e10));

julia> @time hash(x)
  7.024885 seconds (1 allocation: 16 bytes)
0xf74daa50de3a895a

JeffreySarnoff · January 23, 2021, 10:33am

Can any of the fields that are Vector{UInt8} be sampled very roughly? eg:

hash2(vec::Vector{UInt8}) =
  hash(v[end], v[1] % UInt64))

hash3(vec::Vector{UInt8}) =
  hash(v[end], hash(v[end>>1], v[1] % UInt64))

hash4(vec::Vector{UInt8}) =
  hash(v[end], hash(v[end-1], hash(v[2], v[1] % UInt64)))

hash4(vec::Vector{UInt8}) =
  hash(v[end], hash(v[end>>1], hash(v[end>>2], v[1] % UInt64)))

rfourquet · January 23, 2021, 11:21am

Just to make explicit the idea behind the previous answers: you are free to do anything with hash as long as the invariant “a == b implies hash(a) == hash(b)” is maintained. So you can define hash to return a constant number, this is very fast to compute but will result in collisions (this is inefficient) when objects are stored in a Set. So the idea is to find a tradeoff such that hash is reasonably fast while limiting the number of collisions (i.e. we want hash(a) != hash(b) when a != b as much as possible).

kristoffer.carlsson · January 23, 2021, 4:21pm

What are you calling hash for? What are you using the result for?

bsuwal · January 23, 2021, 4:35pm

I use hash(Node) as keys to my Dict, which is of the form Dict{UInt64, Int64}. My dictionary gets gigantic and I was hitting RAM issues so I resorted to using the hash() of the Nodes as keys instead, which works for my use case. However, this means that everytime I add to the dictionary I have to hash() my new Node, and likewise for everytime I want to look up a value.

kristoffer.carlsson · January 23, 2021, 4:37pm

Maybe you can use an IdDict instead?

bsuwal · January 23, 2021, 5:08pm

Thank you for the excellent suggestion – but my understanding of IdDict is that it uses the === operator i.e that you want to have dict by with unique keys by object identity instead of value equality, which is not the case for me. My Node objects are different but I want uniqueness to be defined by value.

tisztamo · January 23, 2021, 7:39pm

Maybe caching the hash in the object itself and invalidate/update when the stored values change?

roflmaostc · January 23, 2021, 8:51pm

Don’t we have basically three options?

You pay the price of completely hashing the full list
You use the label as hash? (I mean, what’s the point of having a label if it’s not unique?)
What @tisztamo suggested.

bsuwal · January 25, 2021, 5:40am

Thank you so much for everyone’s help! What ended up working for me was @tisztamo 's suggestion - storing the hash of the Node struct as a field inside Node itself, and updating it when the values changed.

I am also using a Set to keep my Nodes and this solution entailed changing the hashindex function in Dict (because Sets are implemented as Dicts under the hood) from this:

hashindex(key, sz) = (((hash(key)%Int) & (sz-1)) + 1)::Int

to this:

import Base: Dict
hashindex(node::Node, sz) = (((node.hash%Int) & (sz-1)) + 1)::Int

that is to say, I had to rewrite this function to check for the hash field instead of recomputing it.

This sped me up tons. Thanks!!

Jeff_Emanuel · January 25, 2021, 4:54pm

How do you detect when elements of the array fields change so you can update the cached hash value?

tisztamo · January 25, 2021, 7:36pm

Yeah, cache invalidation is hard…

https://www.karlton.org/2017/12/naming-things-hard/

I think there is no simple and general solution for this, at least if you want to allow manipulating the content from the “outside”, or you have a deeply nested structure.

But in this concrete case it seems possible to create a custom array type with overloaded setindex!, that either notifies the container to invalidate the hash-cache, or calculates the diff of the hash and updates the cache - which one is better depends on updating patterns, I think.

Topic		Replies	Views
Performance issues when working with dict Performance dictionary	11	1697	November 16, 2022
Custom optimized hash tables with UInt64 keys, lower level than Dict for ultra speed Performance	5	261	May 4, 2024
Hash function for immutable struct containing mutable vectors General Usage mutable-structure , hash	2	727	September 17, 2021
Structural hash collisions Internals & Design	5	157	February 6, 2025
Custom hash not called for field in another struct – expected? General Usage question	4	70	July 19, 2025

Hashing for big structs is slow - any alternative?

Related topics