You’re free to implement Base.hash on your own type, to avoid the fallback to objectid:
search: hash hasmethod haskey hasfield hasproperty skipchars Threads
hash(x[, h::UInt]) -> UInt
Compute an integer hash code such that isequal(x,y) implies
hash(x)==hash(y). The optional second argument h is a hash code to be
mixed with the result.
New types should implement the 2-argument form, typically by calling
the 2-argument hash method recursively in order to mix hashes of the
contents with each other (and with h). Typically, any type that
implements hash should also implement its own == (hence isequal) to
guarantee the property mentioned above. Types supporting subtraction
(operator -) should also implement widen, which is required to hash
values inside heterogeneous arrays.
There is AutoHashEquals.jl if you want to automate this a little, though you’ll have to roll your own if you want to avoid some fields.
May I ask, what’s your motivation behind “able to hash any object”? Why do you think Base.hash is likely to produce collision? Avoiding that requires knowing the distribution of your input data, which directly goes against being able to hash any object - there is no universally optimal hash function.
May I ask, what’s your motivation behind “able to hash any object”?
Right now I have is some sort of caching mechanism. I run simulations for lots of input parameters and want to cache things on hard disk. One simple way is to save the results of a simulation to a file whose name is the hash of the parameters.
But really needing different tradeoffs from Base.hash is a recurrent issue for me.
Avoiding that requires knowing the distribution of your input data, which directly goes against being able to hash any object - there is no universally optimal hash function.
In the theoretic limit yes, in practice, there are many hash functions like SHA3 without any known collisions.
Yes so I think there are three things that need to be done:
Given a blob of contiguous bytes, provide a hash. This can be done with SHA etc
Given hashes of the fields of an object and the hash of its type mix it into a single hash
Produce a hash of a type
I think for 1. and 2. there exist lots of good and well-known algorithms I guess, but I am not experienced with this and don’t know what to pick. So I would appreciate recommendations.
3. is a different kind of more julia specific problem. I guess Base.hash uses objectid here, but that is probably not very stable. Another way would be to use the name of the type, but one needs to be careful about the module in which the type is defined. Not 100% sure how to do this.
@haberdashPI wrote a package StableHashTraits.jl that does some of this. It’s not perfect at avoiding collisions, e.g. the function sin and the string "Base.sin" hash the same, but it’s pretty useful nonetheless. It’s not open source yet but should be soon (in the next few days hopefully).
Ok, reading through @edit hash(x1) yeah it doesn’t take all elements into account for performance reasons. Seems like that code is 4+ years old by now, written by @mbauman - maybe you can shed some light on this? Could this be changed due to better vectorization by now?
I see! So you’re checking whether you’ve already done a simulation by hashing the parameters and checking whether that hash is on disk already? For that something like sha3 does seem more appropriate
There’s nothing wrong with using a different function for your usecase though, is there? A generic fallback like e.g. my_hash(a::Any) = my_hash(transcode(UInt8, string(a))) (or maybe serialize(a ) instead) seems appropriate, but my guess would be that you don’t actually need to incorporate ALL of your input parameters (like arrays) into your cache. You’re bound to generate those arrays somehow, right? You can either save the arrays and reference a name they’re saved with in your hash, or save the input generation parameters for array generation and hash those instead. Both approaches should keep your research perfectly reproducible, while also avoiding the pitfalls you’ve mentioned.
Base.hash has to make different tradeoffs too - it’s a general purpose hash and not supposed to be a cryptographic hash like SHA3. The whole SHA suite is available as the SHA stdlib, so if you really want to use it you’re free to use SHA.sha3_256(string(my_data)) as your file name. Unlike the SHA family of hashing functions, Base.hash has as its purpose very general hashing to be used in Dict or Set, while the SHA family is supposed to produce more or less random output. Computing a whole SHA may be less than desirable, simply because the output state is much too large for efficient memory use of those collections (you want to use as little memory as possible while also producing as few collisions as possible. This sadly requires making assumptions about your data or optimizing for the average case). That’s why I asked about what you’re doing with it.
Ohhh, how deep do you want to go? We used to hash every element in every AbstractArray. This led to a situation where you could hang a Julia session to the heat death of the universe just by putting the range 1:2^60 into a dictionary (#5778) or trying to hash sprand(10^8, 10^8, 10^-8). So ranges got special cased (at the expense of equality) (#6084). And then we started hashing the run-length encoding of arrays to support sparse matrices (#10167). But the inequality between ranges and arrays was a major thorn… so we moved to the diff of the RLE to support linear spacing of elements (#16401). But this meant that hashing required elements to support - (and widening, too) or be numeric or figure out that they don’t support it to do a different form of hashing and it was a major can of worms.
So that path led us to the idea of only hashing some distinct elements. The exact scheme to select which elements to use went through a lot of permutations, but the basic motivation was:
I’ve only noticed this now, but isn’t this strictly more powerful than hashing…? After all, serialization is the process of creating an exact & reversible representation. This is not the case with hashes, because they collide - they’re not a bijection, by design.