Persistent hash of julia object


#1

I have julia objects and want to hash them such that:

  1. If two objects are equal, then their hashes are equal. The other way round is also true with overwhelming probability.
  2. The hash should not change across sessions. E.g. hashing an object, saving to jld, rebooting, loading, hashing yields the same answer.
  3. Ideally this would work for objects of arbitrary type, but I guess that’s an endless rabbit hole. So I am happy if it works for bitstypes, Strings, Symbols and immutables build up of these.

What is a reasonable way to achieve this? Is there maybe some package? The hash function in base seems to be inconsistent across sessions.


#2

The hash function in base frequently creates a hash based on the address of the object, for performance reasons, however, as you have seen, that hash value is not useful across sessions or processes.

I don’t believe there currently is any such function. It might best be added to Base, so that it can use the current hash methods when possible.
(I don’t think it would be that difficult, maybe you could submit a PR to do so)


#3

I was not aware of that. How would that satisfy == for hash, if the two objects are otherwise equal but they are at different addresses?


#4

This is used for things like symbols, where they are interned, and you cannot have two symbols having the same string contents at different addresses. It avoids having to do an expensive O(N) operation to calculate a hash value.


#5

Note: this is the cause of fairly large hassles for us, because you can’t precompile Dicts reliably, because if the values being hashed use a hash based on the pointer (ObjectId) (which can mean something contained various levels deep in a structure), will not be recreated correctly when the module is reloaded (you have to put the code in __init__()).
Whether that is a known limitation of the implementation, bug, design flaw, or whatever, I’ll leave up to other people!


#6

It wouldn’t? [in general, I’m not aware that you’re required to provide hash for your objects to make that happen.]

“implies” below (as always) doesn’t work from right-to-left:

help?> hash

Compute an integer hash code such that isequal(x,y) implies hash(x)==hash(y). The optional second argument h is a hash code to be mixed with the result.

julia> @edit hash("Palli")

hash(x::Any) = hash(x, zero(UInt))

..

## hashing general objects ##

hash(x::ANY, h::UInt) = 3*object_id(x) - h

[Seems strings are not interned.]

julia> object_id("Palli")
0x2d22622f8fa6f389

julia> object_id("Palli")
0x7d05cdd0cbf4b7f7


[Still this still happens, because of special case hash below]

julia> hash("Palli")
0xd4199fe90d0f820b

julia> hash("Palli")
0xd4199fe90d0f820b

function hash(s::Union{String,SubString{String}}, h::UInt)
    h += memhash_seed
    # note: use pointer(s) here (see #6058).
    ccall(memhash, UInt, (Ptr{UInt8}, Csize_t, UInt32), pointer(s), sizeof(s), h % UInt32) + h
end

#7

Note: I think this could be solved by having the serialization / deserialization functions not save the hash table, just the pairs, and be smart enough to rehash the Dicts upon deserialization.


#8

hash should satisfy all your requirements.


#9

With the caveat that we don’t guarantee it to be stable across multiple julia versions and the return type is different on 32 vs 64bits.


#10

I agree this is easy for stuff build from String, Symbol, bitstype… but the general case is very hard/it is not even clear what the correct behaviour is. For example think of a function depending on global variables.


#11

Okay the following behaviour is a problem for me:

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.1-pre+27 (2016-11-04 08:41 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 25f7066 (73 days old release-0.5)
|__/                   |  x86_64-linux-gnu

julia> immutable Foo
       s::String
       end

julia> Foo("a") |> hash
0xe7585c2a6732e304

julia> exit()
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.1-pre+27 (2016-11-04 08:41 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 25f7066 (73 days old release-0.5)
|__/                   |  x86_64-linux-gnu

julia> immutable Foo
       s::String
       end

julia> Foo("a") |> hash
0x56df03a6764b2325

#12

For objects like that, I think it would have to use reflection to build a hash based on all of the values of the fields.
Of course, you’d still need to have special case code for types that are holding pointers, for example.


#13

For custom types you have to overload a hash

Try using AutoHashEquals if you are lazy.