Hash Collision in Memoization

marco_menarini · November 20, 2023, 6:10pm

Using the Caching.jl and LRUCache.jl packages and reading Tom Kwong’s book on the memorization pattern, I noticed an edge case that is never considered. No matter how big, a hash can be repeated for different data.

Granted, it’s almost impossible with a 64-bit hash, but it might be an issue for functions that benefit even with the overhead of this pattern.

I know that a dict in Julia is implemented as an hashmap, and it might be able to handle collision, however if the input is an hash itself, two identical hash would lead to the same value.

Is there another way to serialize safely the information instead of hashing that would make the key for sure unique for the dictionary?

Otherwise the other option is to store all the tuple (Input,Output) for every hash and check if

the hashed key exist
the input key and the dictionary key(s) are the same

If they are the same, just retrieve the value, otherwise, add a new value to the entry

hkey=hash(key)
if haskey(dict,hkey) &&  sum(key.==getindex.(dict[key],1))
    v=dict[hkey] 
    return findfirst(x -> x[1] == a1, arr)[2]
else
   dict[hkey]= vcat(dict[hkey],(key,val))

I understand that it is an uncommon occurrence, but if it were to happen, it would have the potential to disrupt an entire simulation that relied on that pattern for efficiency. In my opinion, chaining the output could be a more effective solution, even if it requires additional memory.
Sometimes, a key containing multiple values is removed from the dictionary using LRU. However, in this case, it means that the value computed by all the entries was almost never used. Therefore, redoing the computation wouldn’t be such a hassle.

Another way would be using an efficient data structure in Julia that is able to address this problem automatically

Oscar_Smith · November 20, 2023, 6:17pm

You’re overthinking this. Both of these packages use Dicts to store the mapping which handle the possibility for collisions. If you want to serialize this data, you can simply serialize the Dictionary which will preserve the correct behavior.

ericphanson · November 20, 2023, 6:29pm

Caching seems to use a Dict to store the hash, which is problematic as called out in the OP: https://github.com/zgornel/Caching.jl/blob/16997d2626b42b71766cf342d9e11545243b6260/src/cache.jl#L39

And caches that support spill-to-disk can have issues with this as well.

But LRUCache.jl looks ok to me.

zgornel · November 20, 2023, 6:34pm

If collisions are an issue, the hash (the Julia one) function should be replaced/updated. In practical use-cases, it should not be a problem.
Adding the input in the key sounds like a bad idea as these can be potentially large. Most probably, there are safer ways of handling key creation like concatenating several hashes before adding to the Dict.

marco_menarini · November 20, 2023, 6:37pm

I thought that the Dict only checked the input to avoid collision (i.e. it creates a small hash for the key, and stores key, and value). However, both of these packages first transform the input into a hash and then use the hash to check inside the dictionary.

This means that there is the possibility that two different input map to the same key

([1,2,3]) ->hashing-> KEY
([5,7,2,3]) ->hashing-> KEY

d[KEY]

The reason why they hashed the input in both modules is to allow for a variable number of arguments, but doing so they end up creating another possible (but rarer) problem.

I checked both source code and I couldn’t find any check for this eventuality (and I also checked the source code of base, so I saw that the entries in the implementation of Dict are saved as an hashmap)

marco_menarini · November 20, 2023, 6:39pm

How does LRUCache deal with it? I tried to look into the code but I didn’t see anything. Can you help me? Thanks

zgornel · November 20, 2023, 6:44pm

Check out get here and _unsafe_getindex here

I have not personally looked into the package so cant say more about its hash collision properties…

ericphanson · November 20, 2023, 6:52pm

It uses Dict{K, V}, where K is the key type (as opposed to Dict{UInt, V}, where UInt is the type of the hash output, like in Caching.jl). It relies on the Dict to handle collisions. The rest of the code is about locking (for thread safety) and keeping track of the most recently used entries for cache eviction. In particular, searching for hash reveals no results, which is a good sign because it isn’t handling the hashing (and hash collisions) but rather defers that to the datastructures that already handle it correctly (Dict).

ericphanson · November 20, 2023, 6:55pm

BTW, the unsafe_ stuff here hopefully isn’t too scary. That refers to operations that must take place within the lock for thread-safety, to keep the components of the data structure in sync and so one “atomically” observes updates. It doesn’t have much to do with hashing or collisions.

marco_menarini · November 20, 2023, 7:09pm

Oh yeah that was clear.

So the only limit is that it cannot accept variable number of arguments, but it would be ok if the input is

Tuple{Int,Int,Float64,Vector{Float64})

Does it work even if the Vector can change size from input to input?

The use case for me is computing expensive basis functions at constant points, and since they are created through recursion, saving them would reduce the computation (and I can reuse the partial result if the degree of the polynomial changes or if I refine the mesh since the support is local and most of the functions stay the same)

ericphanson · November 20, 2023, 9:02pm

Yes, LRUCache is very general and should be able to contain anything (just like Dict), after all, it is really just a Dict with a lock for thread safety, plus some extra handling so it doesn’t grow too big. You can customize how it measures size (by) and the maximum size it will grow to (maxsize).

If you don’t need to be concerned about it growing too large (perhaps it can’t grow too large by design in your usage and you can reset it every function call), and you don’t need thread safety, then a regular Dict will work just as well.

For recursion, I would definitely make sure the size is set large enough so you don’t recompute many elements.

Topic		Replies	Views
Fastest way to do memoization? Performance performance	11	680	February 23, 2023
Different Dicts hashing to the same value General Usage question	6	1258	January 13, 2017
Cache Data Structure for Lazy Array Performance data_structures	5	676	August 8, 2023
Memoization performance General Usage	6	2132	October 5, 2020
Custom hash function in dict is very slow? Performance	3	896	April 20, 2021

Hash Collision in Memoization

Related topics