Hashes change whenever package is pre-compiled

e3c6 · April 14, 2020, 10:19am

I have a very simple package I am developing, say defined like this:

module MyPkg
struct A
   a::Int
end
end # module

Now in a Julia session, I import this package, and compute a hash of an object:

julia> using MyPkg # first-usage, precompiles
julia> hash(A(5)) # returns a hash

Next, I do some modification of the package source code,

module MyPkg
struct A
   a::Int
end
f() = 2 # modification of package source code
end # module

So next time I use this package, a precompilation is triggered again:

julia> using MyPkg # precompiles again
julia> hash(A(5)) # returns a different hash than before!

Even though the type A has not changed, for some reason I now get a different hash for A(5). Two questions:

Why is this happening? Why the hash is not the same in the second usage?
How can I get a “deterministic” hash that doesn’t change whenever the package is pre-compiled?

Note that here I am doing “minor” modifications, in that things like the package identity (name, uuid) nor the type A definition have changed. So what is the hash depending on that is changing here?

Tamas_Papp · April 14, 2020, 11:07am

The fallback implementation of hash uses objectid, which is very fast but can change from sessions to session.

The solution is writing your own hash function.

e3c6 · April 14, 2020, 11:18am

Is it possible to have a generic hash function, that doesn’t have this instability?

kristoffer.carlsson · April 14, 2020, 11:43am

https://github.com/andrewcooke/AutoHashEquals.jl

e3c6 · April 14, 2020, 11:47am

Thanks, I think I understand what the problem is. However it seems that the hash should not change if the identity of the object doesn’t change?

e3c6 · April 14, 2020, 11:48am

AutoHashEquals needs to be used at the definition of the type. I meant I would like to have a generic hash that can be used for an outside type (probably defined without AutoHashEquals), and not have this instability.

kristoffer.carlsson · April 14, 2020, 11:56am

Not sure what you mean here. myhash(x) = UInt64(1) would satisfy your criteria.

e3c6 · April 14, 2020, 12:11pm

Yes but that would have too many collisions. I mean satisfying the normal typical requirements of a hash function (as few collisions as possible), but still being consistent even if the package defining a type has to be precompiled (if the type definition itself does not change).

krrutkow · April 14, 2020, 12:16pm

What you are expecting is a persistent hash function, which is not really what hash is meant for. Look to something like https://github.com/staticfloat/SHA.jl for supporting persistent hashing. As a core/builtin function, hash should have the goal of being very performant without excess functionality. I use the fact that it can produce different hashes in different Julia sessions to actually uncover bugs in code. Not my code, of course…

e3c6 · April 14, 2020, 12:23pm

Thanks. Is https://github.com/staticfloat/SHA.jl the same as the stdlib SHA? Unfortunately it seems these functions only take certain kind of arguments (strings, Array{UInt8}, or IO objects).

krrutkow · April 14, 2020, 12:30pm

Yeah, it might be in stdlibs now actually, been a while since I used it directly… You might need to create a method to dispatch for your types, or a generic method to iterate over the propertynames of types to build the SHA.

Tamas_Papp · April 14, 2020, 12:46pm

That would be an option, but note the performance trade-off — for many cases objectid is a good choice.

Generally I think it would make sense to have various predefined implementations of hashing (and of course equality, to be consistent), to which one can opt-in with traits.

e3c6 · April 14, 2020, 12:49pm

Could also be implemented in separate packages. Something like SHA mentioned above, if it handled generic types.