Alternatives to `Base.hash`?

jw3126 · April 7, 2022, 10:27am

Are there packages that implement alternatives to Base.hash? I am interested in a somewhat different set of tradeoffs than what Base.hash provides:

Should be able to hash any object (like Base.hash)
Should have a negligible probability to produce collisions (unlike Base.hash).
Can be slower than Base.hash, calculated hash can take more bytes than UInt.
Should depend on object structure, not object identity. For instance Base.hash(MyMutable(1)) != Base.hash(MyMutable(1)), but I want this equality to hold.
Should be platform-independent. If I call hash(MyObject(42)), the result should be the same on different computers, days, operating systems, julia versions (but same package versions).

Does a package exist that provides such a hash or something close to it? If not are there recommendations for an algorithm to implement?

Sukera · April 7, 2022, 11:34am

You’re free to implement Base.hash on your own type, to avoid the fallback to objectid:

help?> hash
search: hash hasmethod haskey hasfield hasproperty skipchars Threads

  hash(x[, h::UInt]) -> UInt

  Compute an integer hash code such that isequal(x,y) implies
  hash(x)==hash(y). The optional second argument h is a hash code to be
  mixed with the result.

  New types should implement the 2-argument form, typically by calling
  the 2-argument hash method recursively in order to mix hashes of the
  contents with each other (and with h). Typically, any type that
  implements hash should also implement its own == (hence isequal) to
  guarantee the property mentioned above. Types supporting subtraction
  (operator -) should also implement widen, which is required to hash
  values inside heterogeneous arrays.

There is AutoHashEquals.jl if you want to automate this a little, though you’ll have to roll your own if you want to avoid some fields.

May I ask, what’s your motivation behind “able to hash any object”? Why do you think Base.hash is likely to produce collision? Avoiding that requires knowing the distribution of your input data, which directly goes against being able to hash any object - there is no universally optimal hash function.

Sukera · April 7, 2022, 11:49am

Also of interest are probably

https://github.com/JuliaLang/julia/issues/40717

and

https://github.com/JuliaLang/julia/issues/4648

MarcMush · April 7, 2022, 12:15pm

see also

jw3126 · April 7, 2022, 12:46pm

Thanks @Sukera for the quick replies.

You’re free to implement Base.hash on your own type, to avoid the fallback to objectid

This works only for types I “own” for other types this is piracy.

Why do you think Base.hash is likely to produce collision?

Because it does not even depend on all the input bits:

N = 10^5
x1 = randn(N)
x2 = copy(x1)
x2[10000] = 0
@assert hash(x1) == hash(x2)

May I ask, what’s your motivation behind “able to hash any object”?

Right now I have is some sort of caching mechanism. I run simulations for lots of input parameters and want to cache things on hard disk. One simple way is to save the results of a simulation to a file whose name is the hash of the parameters.
But really needing different tradeoffs from Base.hash is a recurrent issue for me.

Avoiding that requires knowing the distribution of your input data, which directly goes against being able to hash any object - there is no universally optimal hash function.

In the theoretic limit yes, in practice, there are many hash functions like SHA3 without any known collisions.

jw3126 · April 7, 2022, 12:49pm

Thanks, it is a workaround, but it has some issues:

Calculating a hash then requires writing and reading from disk
The hash is only as stable as the serialization mechanism

Oscar_Smith · April 7, 2022, 12:56pm

SHA and variants are only defined for String (or equivalently, Vector{UInt8}. The hard part isn’t a good hashing algorithm, it’s expanding it to be able to hash anything.

jw3126 · April 7, 2022, 1:41pm

Yes so I think there are three things that need to be done:

Given a blob of contiguous bytes, provide a hash. This can be done with SHA etc
Given hashes of the fields of an object and the hash of its type mix it into a single hash
Produce a hash of a type

I think for 1. and 2. there exist lots of good and well-known algorithms I guess, but I am not experienced with this and don’t know what to pick. So I would appreciate recommendations.
3. is a different kind of more julia specific problem. I guess Base.hash uses objectid here, but that is probably not very stable. Another way would be to use the name of the type, but one needs to be careful about the module in which the type is defined. Not 100% sure how to do this.

ericphanson · April 7, 2022, 1:55pm

@haberdashPI wrote a package StableHashTraits.jl that does some of this. It’s not perfect at avoiding collisions, e.g. the function sin and the string "Base.sin" hash the same, but it’s pretty useful nonetheless. It’s not open source yet but should be soon (in the next few days hopefully).

jw3126 · April 7, 2022, 1:58pm

Sounds awesome, looking forward to trying it out.

Sukera · April 7, 2022, 4:01pm

Ok, reading through @edit hash(x1) yeah it doesn’t take all elements into account for performance reasons. Seems like that code is 4+ years old by now, written by @mbauman - maybe you can shed some light on this? Could this be changed due to better vectorization by now?

I see! So you’re checking whether you’ve already done a simulation by hashing the parameters and checking whether that hash is on disk already? For that something like sha3 does seem more appropriate

There’s nothing wrong with using a different function for your usecase though, is there? A generic fallback like e.g. my_hash(a::Any) = my_hash(transcode(UInt8, string(a))) (or maybe serialize(a ) instead) seems appropriate, but my guess would be that you don’t actually need to incorporate ALL of your input parameters (like arrays) into your cache. You’re bound to generate those arrays somehow, right? You can either save the arrays and reference a name they’re saved with in your hash, or save the input generation parameters for array generation and hash those instead. Both approaches should keep your research perfectly reproducible, while also avoiding the pitfalls you’ve mentioned.

Base.hash has to make different tradeoffs too - it’s a general purpose hash and not supposed to be a cryptographic hash like SHA3. The whole SHA suite is available as the SHA stdlib, so if you really want to use it you’re free to use SHA.sha3_256(string(my_data)) as your file name. Unlike the SHA family of hashing functions, Base.hash has as its purpose very general hashing to be used in Dict or Set, while the SHA family is supposed to produce more or less random output. Computing a whole SHA may be less than desirable, simply because the output state is much too large for efficient memory use of those collections (you want to use as little memory as possible while also producing as few collisions as possible. This sadly requires making assumptions about your data or optimizing for the average case). That’s why I asked about what you’re doing with it.

mbauman · April 7, 2022, 4:49pm

Ohhh, how deep do you want to go? We used to hash every element in every AbstractArray. This led to a situation where you could hang a Julia session to the heat death of the universe just by putting the range 1:2^60 into a dictionary (#5778) or trying to hash sprand(10^8, 10^8, 10^-8). So ranges got special cased (at the expense of equality) (#6084). And then we started hashing the run-length encoding of arrays to support sparse matrices (#10167). But the inequality between ranges and arrays was a major thorn… so we moved to the diff of the RLE to support linear spacing of elements (#16401). But this meant that hashing required elements to support - (and widening, too) or be numeric or figure out that they don’t support it to do a different form of hashing and it was a major can of worms.

So that path led us to the idea of only hashing some distinct elements. The exact scheme to select which elements to use went through a lot of permutations, but the basic motivation was:

mbauman commented on #26022 on 14 Feb 2018:

So the options and their tradeoffs here are:

If we don’t hash every single element, it’s easy to mutate an array and have its hash remain the same. That’s just fine, though, since hash collisions are allowed and will fall back to equality checks. We just don’t want it to happen too often, since those equality checks could be expensive to repeatedly perform.

If we hash every single element, it’s very expensive for sparse matrices to compute every single element. Making hashing ranges O(N) might make an otherwise cromulent structure like 1:typemax(Int) stall till the heat death of the universe if you happen to put it in a set.

If we do something clever, we have to be doubly clever in order to deal with heterogeneous arrays of things that might not support the cleverness or might overflow/underflow/change precision (e.g., subtraction for diff, addition for sum, promotion and precision difficulties in both cases). The only sort of cleverness I’d advocate for is the kind that only relies upon hashing and equality — like run-length encoding or distinct elements.

github.com/JuliaLang/julia

RFC: Simpler array hashing

JuliaLang:master ← JuliaLang:mb/simplerarrayhashing

opened 01:16AM - 13 Feb 18 UTC

mbauman

+91 -161

This is a straw-man implementation of a simpler array hashing scheme. It's very… basic and has room for further optimizations, but it's intention is to provide a starting point as an alternative to #25822. In short: This hashes the size of the array and then the first three distinct elements and their linear indices and then the last three distinct elements and their linear distance from the end of the array. The exact scheme here is open to bikeshedding -- we could use more or less distinct elements. I use "distinct elements" as a way of ensuring that all relatively empty sparse arrays don't hash similarly, and I hash elements at both the beginning and end of the array because they're the simplest to get to and final elements will be the "most expensive" to discover that they differ if we have to fall back to an equality check. The most complicated part is keeping track of what you hashed in order to prevent running through the entire array twice (once forwards and once backwards).

stevengj · April 7, 2022, 5:03pm

No it doesn’t. You can use write (or serialize, or …) to write bytes to a buffer, and then hash that. e.g.

using CRC32c
function myhash(x, h::UInt32=UInt32(0))
    io = IOBuffer()
    write(io, x)
    return crc32c(take!(io), h)
end

Sukera · April 7, 2022, 6:16pm

I’ve only noticed this now, but isn’t this strictly more powerful than hashing…? After all, serialization is the process of creating an exact & reversible representation. This is not the case with hashes, because they collide - they’re not a bijection, by design.

ericphanson · April 11, 2022, 2:36pm

Now it is! GitHub - beacon-biosignals/StableHashTraits.jl: Compute hashes over any Julia object simply and reproducibly

Topic		Replies	Views
Persistent hash of julia object General Usage	12	3127	January 20, 2017
Overloading Base.hash is 2x slower than an identical function Performance	1	350	June 17, 2021
Fast hash for custom type General Usage question , performance , hash	10	901	September 17, 2021
Stable hashing across Julia versions General Usage	7	1329	September 23, 2020
Hashing of different types with same contents General Usage	2	289	February 8, 2021

Alternatives to `Base.hash`?

Related topics