Are there packages that implement alternatives to Base.hash? I am interested in a somewhat different set of tradeoffs than what Base.hash provides:
Should be able to hash any object (like Base.hash)
Should have a negligible probability to produce collisions (unlike Base.hash).
Can be slower than Base.hash, calculated hash can take more bytes than UInt.
Should depend on object structure, not object identity. For instance Base.hash(MyMutable(1)) != Base.hash(MyMutable(1)), but I want this equality to hold.
Should be platform-independent. If I call hash(MyObject(42)), the result should be the same on different computers, days, operating systems, julia versions (but same package versions).
Does a package exist that provides such a hash or something close to it? If not are there recommendations for an algorithm to implement?
Youâre free to implement Base.hash on your own type, to avoid the fallback to objectid:
help?> hash
search: hash hasmethod haskey hasfield hasproperty skipchars Threads
hash(x[, h::UInt]) -> UInt
Compute an integer hash code such that isequal(x,y) implies
hash(x)==hash(y). The optional second argument h is a hash code to be
mixed with the result.
New types should implement the 2-argument form, typically by calling
the 2-argument hash method recursively in order to mix hashes of the
contents with each other (and with h). Typically, any type that
implements hash should also implement its own == (hence isequal) to
guarantee the property mentioned above. Types supporting subtraction
(operator -) should also implement widen, which is required to hash
values inside heterogeneous arrays.
There is AutoHashEquals.jl if you want to automate this a little, though youâll have to roll your own if you want to avoid some fields.
May I ask, whatâs your motivation behind âable to hash any objectâ? Why do you think Base.hash is likely to produce collision? Avoiding that requires knowing the distribution of your input data, which directly goes against being able to hash any object - there is no universally optimal hash function.
May I ask, whatâs your motivation behind âable to hash any objectâ?
Right now I have is some sort of caching mechanism. I run simulations for lots of input parameters and want to cache things on hard disk. One simple way is to save the results of a simulation to a file whose name is the hash of the parameters.
But really needing different tradeoffs from Base.hash is a recurrent issue for me.
Avoiding that requires knowing the distribution of your input data, which directly goes against being able to hash any object - there is no universally optimal hash function.
In the theoretic limit yes, in practice, there are many hash functions like SHA3 without any known collisions.
SHA and variants are only defined for String (or equivalently, Vector{UInt8}. The hard part isnât a good hashing algorithm, itâs expanding it to be able to hash anything.
Yes so I think there are three things that need to be done:
Given a blob of contiguous bytes, provide a hash. This can be done with SHA etc
Given hashes of the fields of an object and the hash of its type mix it into a single hash
Produce a hash of a type
I think for 1. and 2. there exist lots of good and well-known algorithms I guess, but I am not experienced with this and donât know what to pick. So I would appreciate recommendations.
3. is a different kind of more julia specific problem. I guess Base.hash uses objectid here, but that is probably not very stable. Another way would be to use the name of the type, but one needs to be careful about the module in which the type is defined. Not 100% sure how to do this.
@haberdashPI wrote a package StableHashTraits.jl that does some of this. Itâs not perfect at avoiding collisions, e.g. the function sin and the string "Base.sin" hash the same, but itâs pretty useful nonetheless. Itâs not open source yet but should be soon (in the next few days hopefully).
Ok, reading through @edit hash(x1) yeah it doesnât take all elements into account for performance reasons. Seems like that code is 4+ years old by now, written by @mbauman - maybe you can shed some light on this? Could this be changed due to better vectorization by now?
I see! So youâre checking whether youâve already done a simulation by hashing the parameters and checking whether that hash is on disk already? For that something like sha3 does seem more appropriate
Thereâs nothing wrong with using a different function for your usecase though, is there? A generic fallback like e.g. my_hash(a::Any) = my_hash(transcode(UInt8, string(a))) (or maybe serialize(a ) instead) seems appropriate, but my guess would be that you donât actually need to incorporate ALL of your input parameters (like arrays) into your cache. Youâre bound to generate those arrays somehow, right? You can either save the arrays and reference a name theyâre saved with in your hash, or save the input generation parameters for array generation and hash those instead. Both approaches should keep your research perfectly reproducible, while also avoiding the pitfalls youâve mentioned.
Base.hash has to make different tradeoffs too - itâs a general purpose hash and not supposed to be a cryptographic hash like SHA3. The whole SHA suite is available as the SHA stdlib, so if you really want to use it youâre free to use SHA.sha3_256(string(my_data)) as your file name. Unlike the SHA family of hashing functions, Base.hash has as its purpose very general hashing to be used in Dict or Set, while the SHA family is supposed to produce more or less random output. Computing a whole SHA may be less than desirable, simply because the output state is much too large for efficient memory use of those collections (you want to use as little memory as possible while also producing as few collisions as possible. This sadly requires making assumptions about your data or optimizing for the average case). Thatâs why I asked about what youâre doing with it.
Ohhh, how deep do you want to go? We used to hash every element in every AbstractArray. This led to a situation where you could hang a Julia session to the heat death of the universe just by putting the range 1:2^60 into a dictionary (#5778) or trying to hash sprand(10^8, 10^8, 10^-8). So ranges got special cased (at the expense of equality) (#6084). And then we started hashing the run-length encoding of arrays to support sparse matrices (#10167). But the inequality between ranges and arrays was a major thorn⌠so we moved to the diff of the RLE to support linear spacing of elements (#16401). But this meant that hashing required elements to support - (and widening, too) or be numeric or figure out that they donât support it to do a different form of hashing and it was a major can of worms.
So that path led us to the idea of only hashing some distinct elements. The exact scheme to select which elements to use went through a lot of permutations, but the basic motivation was:
Iâve only noticed this now, but isnât this strictly more powerful than hashingâŚ? After all, serialization is the process of creating an exact & reversible representation. This is not the case with hashes, because they collide - theyâre not a bijection, by design.