I have julia objects and want to hash them such that:
If two objects are equal, then their hashes are equal. The other way round is also true with overwhelming probability.
The hash should not change across sessions. E.g. hashing an object, saving to jld, rebooting, loading, hashing yields the same answer.
Ideally this would work for objects of arbitrary type, but I guess that’s an endless rabbit hole. So I am happy if it works for bitstypes, Strings, Symbols and immutables build up of these.
What is a reasonable way to achieve this? Is there maybe some package? The hash function in base seems to be inconsistent across sessions.
The hash function in base frequently creates a hash based on the address of the object, for performance reasons, however, as you have seen, that hash value is not useful across sessions or processes.
I don’t believe there currently is any such function. It might best be added to Base, so that it can use the current hash methods when possible.
(I don’t think it would be that difficult, maybe you could submit a PR to do so)
This is used for things like symbols, where they are interned, and you cannot have two symbols having the same string contents at different addresses. It avoids having to do an expensive O(N) operation to calculate a hash value.
Note: this is the cause of fairly large hassles for us, because you can’t precompile Dicts reliably, because if the values being hashed use a hash based on the pointer (ObjectId) (which can mean something contained various levels deep in a structure), will not be recreated correctly when the module is reloaded (you have to put the code in __init__()).
Whether that is a known limitation of the implementation, bug, design flaw, or whatever, I’ll leave up to other people!
It wouldn’t? [in general, I’m not aware that you’re required to provide hash for your objects to make that happen.]
“implies” below (as always) doesn’t work from right-to-left:
help?> hash
Compute an integer hash code such that isequal(x,y) implies hash(x)==hash(y). The optional second argument h is a hash code to be mixed with the result.
julia> @edit hash("Palli")
hash(x::Any) = hash(x, zero(UInt))
..
## hashing general objects ##
hash(x::ANY, h::UInt) = 3*object_id(x) - h
[Seems strings are not interned.]
julia> object_id("Palli")
0x2d22622f8fa6f389
julia> object_id("Palli")
0x7d05cdd0cbf4b7f7
[Still this still happens, because of special case hash below]
julia> hash("Palli")
0xd4199fe90d0f820b
julia> hash("Palli")
0xd4199fe90d0f820b
function hash(s::Union{String,SubString{String}}, h::UInt)
h += memhash_seed
# note: use pointer(s) here (see #6058).
ccall(memhash, UInt, (Ptr{UInt8}, Csize_t, UInt32), pointer(s), sizeof(s), h % UInt32) + h
end
Note: I think this could be solved by having the serialization / deserialization functions not save the hash table, just the pairs, and be smart enough to rehash the Dicts upon deserialization.
I agree this is easy for stuff build from String, Symbol, bitstype… but the general case is very hard/it is not even clear what the correct behaviour is. For example think of a function depending on global variables.
For objects like that, I think it would have to use reflection to build a hash based on all of the values of the fields.
Of course, you’d still need to have special case code for types that are holding pointers, for example.