Alternatives to `Base.hash`?

Are there packages that implement alternatives to Base.hash? I am interested in a somewhat different set of tradeoffs than what Base.hash provides:

  • Should be able to hash any object (like Base.hash)
  • Should have a negligible probability to produce collisions (unlike Base.hash).
  • Can be slower than Base.hash, calculated hash can take more bytes than UInt.
  • Should depend on object structure, not object identity. For instance Base.hash(MyMutable(1)) != Base.hash(MyMutable(1)), but I want this equality to hold.
  • Should be platform-independent. If I call hash(MyObject(42)), the result should be the same on different computers, days, operating systems, julia versions (but same package versions).

Does a package exist that provides such a hash or something close to it? If not are there recommendations for an algorithm to implement?

You’re free to implement Base.hash on your own type, to avoid the fallback to objectid:

help?> hash
search: hash hasmethod haskey hasfield hasproperty skipchars Threads

  hash(x[, h::UInt]) -> UInt

  Compute an integer hash code such that isequal(x,y) implies
  hash(x)==hash(y). The optional second argument h is a hash code to be
  mixed with the result.

  New types should implement the 2-argument form, typically by calling
  the 2-argument hash method recursively in order to mix hashes of the
  contents with each other (and with h). Typically, any type that
  implements hash should also implement its own == (hence isequal) to
  guarantee the property mentioned above. Types supporting subtraction
  (operator -) should also implement widen, which is required to hash
  values inside heterogeneous arrays.

There is AutoHashEquals.jl if you want to automate this a little, though you’ll have to roll your own if you want to avoid some fields.

May I ask, what’s your motivation behind “able to hash any object”? Why do you think Base.hash is likely to produce collision? Avoiding that requires knowing the distribution of your input data, which directly goes against being able to hash any object - there is no universally optimal hash function.

3 Likes

Also of interest are probably

https://github.com/JuliaLang/julia/issues/40717

and

https://github.com/JuliaLang/julia/issues/4648

see also

2 Likes

Thanks @Sukera for the quick replies.

You’re free to implement Base.hash on your own type, to avoid the fallback to objectid

This works only for types I “own” for other types this is piracy.

Why do you think Base.hash is likely to produce collision?

Because it does not even depend on all the input bits:

N = 10^5
x1 = randn(N)
x2 = copy(x1)
x2[10000] = 0
@assert hash(x1) == hash(x2)

May I ask, what’s your motivation behind “able to hash any object”?

Right now I have is some sort of caching mechanism. I run simulations for lots of input parameters and want to cache things on hard disk. One simple way is to save the results of a simulation to a file whose name is the hash of the parameters.
But really needing different tradeoffs from Base.hash is a recurrent issue for me.

Avoiding that requires knowing the distribution of your input data, which directly goes against being able to hash any object - there is no universally optimal hash function.

In the theoretic limit yes, in practice, there are many hash functions like SHA3 without any known collisions.

Thanks, it is a workaround, but it has some issues:

  • Calculating a hash then requires writing and reading from disk
  • The hash is only as stable as the serialization mechanism

SHA and variants are only defined for String (or equivalently, Vector{UInt8}. The hard part isn’t a good hashing algorithm, it’s expanding it to be able to hash anything.

1 Like

Yes so I think there are three things that need to be done:

  1. Given a blob of contiguous bytes, provide a hash. This can be done with SHA etc
  2. Given hashes of the fields of an object and the hash of its type mix it into a single hash
  3. Produce a hash of a type

I think for 1. and 2. there exist lots of good and well-known algorithms I guess, but I am not experienced with this and don’t know what to pick. So I would appreciate recommendations.
3. is a different kind of more julia specific problem. I guess Base.hash uses objectid here, but that is probably not very stable. Another way would be to use the name of the type, but one needs to be careful about the module in which the type is defined. Not 100% sure how to do this.

@haberdashPI wrote a package StableHashTraits.jl that does some of this. It’s not perfect at avoiding collisions, e.g. the function sin and the string "Base.sin" hash the same, but it’s pretty useful nonetheless. It’s not open source yet but should be soon (in the next few days hopefully).

2 Likes

Sounds awesome, looking forward to trying it out.

2 Likes

Ok, reading through @edit hash(x1) yeah it doesn’t take all elements into account for performance reasons. Seems like that code is 4+ years old by now, written by @mbauman - maybe you can shed some light on this? Could this be changed due to better vectorization by now?

I see! So you’re checking whether you’ve already done a simulation by hashing the parameters and checking whether that hash is on disk already? For that something like sha3 does seem more appropriate

There’s nothing wrong with using a different function for your usecase though, is there? A generic fallback like e.g. my_hash(a::Any) = my_hash(transcode(UInt8, string(a))) (or maybe serialize(a ) instead) seems appropriate, but my guess would be that you don’t actually need to incorporate ALL of your input parameters (like arrays) into your cache. You’re bound to generate those arrays somehow, right? You can either save the arrays and reference a name they’re saved with in your hash, or save the input generation parameters for array generation and hash those instead. Both approaches should keep your research perfectly reproducible, while also avoiding the pitfalls you’ve mentioned.

Base.hash has to make different tradeoffs too - it’s a general purpose hash and not supposed to be a cryptographic hash like SHA3. The whole SHA suite is available as the SHA stdlib, so if you really want to use it you’re free to use SHA.sha3_256(string(my_data)) as your file name. Unlike the SHA family of hashing functions, Base.hash has as its purpose very general hashing to be used in Dict or Set, while the SHA family is supposed to produce more or less random output. Computing a whole SHA may be less than desirable, simply because the output state is much too large for efficient memory use of those collections (you want to use as little memory as possible while also producing as few collisions as possible. This sadly requires making assumptions about your data or optimizing for the average case). That’s why I asked about what you’re doing with it.

Ohhh, how deep do you want to go? We used to hash every element in every AbstractArray. This led to a situation where you could hang a Julia session to the heat death of the universe just by putting the range 1:2^60 into a dictionary (#5778) or trying to hash sprand(10^8, 10^8, 10^-8). So ranges got special cased (at the expense of equality) (#6084). And then we started hashing the run-length encoding of arrays to support sparse matrices (#10167). But the inequality between ranges and arrays was a major thorn… so we moved to the diff of the RLE to support linear spacing of elements (#16401). But this meant that hashing required elements to support - (and widening, too) or be numeric or figure out that they don’t support it to do a different form of hashing and it was a major can of worms.

So that path led us to the idea of only hashing some distinct elements. The exact scheme to select which elements to use went through a lot of permutations, but the basic motivation was:

4 Likes

No it doesn’t. You can use write (or serialize, or …) to write bytes to a buffer, and then hash that. e.g.

using CRC32c
function myhash(x, h::UInt32=UInt32(0))
    io = IOBuffer()
    write(io, x)
    return crc32c(take!(io), h)
end
4 Likes

I’ve only noticed this now, but isn’t this strictly more powerful than hashing…? After all, serialization is the process of creating an exact & reversible representation. This is not the case with hashes, because they collide - they’re not a bijection, by design.

Now it is! GitHub - beacon-biosignals/StableHashTraits.jl: Compute hashes over any Julia object simply and reproducibly

1 Like