Array hashing is complicated


#1

Moderator’s note: this post was a response split off from Moving BigFloat to stdlib?

I have a feeling that this is a case of putting the cart before the horse.
That PR seems to be a complex way of trying to solve a problem that doesn’t really need to be solved.

A hash is calculated as a way of quickly finding something, to avoid possibly slower equality checks between two elements. Recently, the fact that it’s OK to let hashes collide, if calculating the hash itself is not fast, has been recognized by the Julia team in a PR about hash fallbacks.

Also, does there really need to be a single compatible hash calculation that works across different types?
What if instead, there were a trait that would allow selecting a hash function for different types (combinations of types)?

It seems that there are really (at least) three types of hashes needed, one for scalar values, one for iterables, and one for ranges.

Hashes should only be used as a shortcut to avoid calling isequal if the hashes are compatible (as determined by traits/promotion rules).

For example, if you have a collection of ranges, and their hashes were compatible, then you use those hashes.
If you have a collection with both ranges and vectors, then (maybe depending on the element types), the promotion rules might provide a compatible hash function (but the fallback might be something as simple as a compatible hash of the first and last elements). Depending on the types, the only compatible hash function might be the fallback, something like compat_hash(::Type{Any}, x) = 0 (not actually defined that way, it would need to have a signature with a trait instance)

Have you thought of how arbitrary numeric types (including DecFP, ArbFloats, etc.) could be hashed and compared in a compatible fashion without having to have a type that can hold all values of both without loss (something I don’t think may be possible)?
One way that I’ve found effective (and pretty efficient) is to convert to a string form (using something like the Grisu algorithm), that can be compared in a binary fashion. There are some tricks you can do, to make them compact and compare correctly (i.e. using radix 100, having negative numbers stored with the pairs of digits being stored with a bit to indicate the end, taking care of leading and trailing 0s. etc.).
For a Dict with heterogeneous numeric keys, you can convert all of the numbers, and cache the converted forms, and hash those as binary strings.

@jeff.bezanson What is your opinion of using traits / promotion rules / different hashing functions to handle the heterogeneous collection case, without resorting to widening?

Editted: fixed stupid switch of homo/heterogeneous (shouldn’t write so much without first cup of :coffee:️!)


Moving BigFloat to stdlib?
#2

These issues are really really tricky. I know it’s far from ideal, but we haven’t found a better solution at this point which allows having ranges and arrays compare equal.

I’ll stress again that the problem only affects hashing of heterogeneous collections currently. We need BigFloat just because of that corner case, which needs to be supported in Base. You’re right that we could make this need even less frequent by using a simpler algorithm when we know we don’t need to hash ranges and arrays the same, e.g. when looking up a key in a Dict{Vector{Int}}. But in general it’s needed e.g. for Dict{Any}, so we cannot completely get rid of it either.

Returning a 0 hash as a fallback is clever, but for it to work we would need to return 0 for all values, or at least for all collections we are going to compare which could be equal (e.g. various kinds of arrays and ranges with different number types). If the key type is Any, we don’t know in advance these types, so we would have to use 0 for all collections, which could be very slow. We may as well fall back to O(N) hashing in that case.

In general the approach of falling back, for collections with an abstract element type, to something slow but which doesn’t require a wider type like BigFloat is interesting. It would require adding a parallel, faster hash method which would be used for collections with concrete types (or more precisely for collections which cannot contain both ranges and arrays). We could pass a key/element type as the first argument to hash, which would allow us to implement whatever efficient solution we want. The essential idea would be that we would compute efficiently hashes which are only valid for a certain set of key types, but which shouldn’t be reused in a different context.


#3

In those cases, you can fall back on a simpler compatible hash, as I described.

Since the whole point of hashing is to speed things up, just fallback to a simpler hash, not necessarily fall back to 0.

For example, as a total fallback for hashing on a collection, you might simply generate a hash based on the length or dimensions (and there are traits for those things).
As a total fallback for hashing something <: Number, you could use the method of converting to a canonic bytewise (based on what would be the grisu output) format, and hash that, no need to widen.

All the hash function really has to do in those cases, where something might be isequal, but a compatible hash cannot be calculated easily is not return different numbers - and so in the worst case, return 0.


#4

I don’t think we can do much by handling homogeneous containers specially; Any[1,2] is == to Int[1, 2]. The element type of the dict doesn’t matter; you still need to be able to look up == keys.

I really do want to simplify the range hashing code though. I don’t think needing to hash ranges efficiently comes up often enough. And as Scott says, we could decide to accept more collisions instead, e.g. hashing just the first and last few elements of long ranges. There’s an asymmetry: if a range and array are unequal, you can stop as soon as you hit an unequal element, so accepting more collisions might make sense.


#5

Yes, what I proposed is that you would specify the narrowest supertype of the elements to be hashed, which would allow using a specialized algorithm when you know you don’t need to handle both ranges and arrays at the same time.

So we would apply that approach to all arrays? I guess that could work, at least it wouldn’t be unreasonable to hash the e.g. 1000 entries at the start and end of arrays and ranges, or one entry for each N entries (with N varying so that the total number of entries is fixed). It’s probably very rare to hash large arrays with are almost equal, except for a few elements. OTOH if you do that handling collisions is going to be very slow… I imagine that could happen with sparse vectors.


#6

I don’t understand where this case arises. If two things are equal, they always need to hash equally. You can’t predict whether somebody will try to look up a range in a dict with array keys.


#7

How frequently do people have associative arrays using large heterogeneous arrays as keys?
Calculating very complex hash values can make things a lot slower than simply having a simpler fast hash, and doing a few more comparisons (esp. since the comparisons can exit early, unlike the hash calculations).

If you do calculate an expensive hash, then it should be stored in the Dict (something I recommended a few years ago), instead of just using it to find the correct bucket.


#8

My thought here would be even simpler:

hash(A::AbstractArray, h) = hash(size(A), hash(first_nonzero(A)), hash(last_nonzero(A)), h)

#9

Well, that case could be handled by simply giving up using a hash, and scanning all elements of the dict.
(just like in databases where you don’t have an index that fits)


#10

Upon reflection, julia’s notion of equality and hashes for dicts is kinda weird.

A=Vector{Any}()
push!(A,A)
hash(A)
ERROR: StackOverflowError:

A==A
ERROR: StackOverflowError:

A===A
true

isequal(A,A)
true

Edit: And likewise

D = Dict{Any,Any}()
D[D]=1
D[D]=1
ERROR: StackOverflowError:

Not sure what I would expect; somehow a type-safe version of isequal would be nice (but still have to deal with self-referential structures in some way, either by StackOverflowing or by trying to be clever).


#11

That’s just a bug. I don’t think there’s an issue for that — would you want to open one?


#12

I don’t think a stack overflow says anything about our “notion” of equality. Handling circular references is expensive, and therefore something you might decide not to do. We might be able to do enough tricks to make it worthwhile though.

What is the type-unsafety here?


#13

If you consider it a bug, sure, I’ll open an issue. Not sure whether a doc-fix variant or whether we really want to try to do a deepcopy-style walk.

isequal(1.0, 1)
true

That’s PHP/javascript style. Triple equal does not cut the cake:

"foo"==="foo"
false

Desired behavior of 2.5 equality signs: Check types at every comparison (recursively). Should not be more expensive because this is a nop in inferred code.

edit: For hashes that use 2.5 equality signs: mix in the type tag at every call.

edit2: I am not saying that isequal is bad, mind you. its just that is_really_equal (in the sense that e.g. sha1(serialize(x)))==sha1(serialize(y)), supposing that serialize is pure) is sometimes useful.


#14

Ah, right, I was thinking of situations like data frames joining/grouping, where you know beforehand the list of all entries you’ll have to compare. But for dicts that would not work indeed.

The current algorithm isn’t expensive in general, it only uses BigFloat/BigInt in corner cases. The main problem with it is rather that it doesn’t work for entries inside heterogeneous arrays that don’t implement widen.


#15

Ouch.

Maybe this is a misunderstanding about what isequal is for. It’s supposed to be the same as == except for NaN and -0.0. And most people want 1 == 1.0. I believe 1 and 1.0 are equal in a more-than-casual sense, and in any case comparing them is not type unsafe. You can compare ints to floats in e.g. Java and Scala, and that doesn’t make those languages type-unsafe.

You’ll be happy to know that this is true on master.

True; I think one could write a deepequal function (similar to deepcopy) that does this, and that would be nice to have.


#16

+1 deepequal


#17

But this entire thing is about

julia06> isequal(1:5, collect(1:5))
false

julia07> isequal(1:5, collect(1:5))
true

To be totally honest, after reconsideration, I really would prefer type-safe isequal, with possible exceptions for source-code literals (Int32/UInt/Float32 literals are hard to type).

Maybe recursively call typeof(a)==typeof(b) && isequal(a,b) and make == mean assert(typeof(a)==typeof(b)); isequal(a,b).
(edit: stupid idea)

On the other hand, this ship probably has sailed?

The following all look bad, bad, bad to me:

Any[3,4,5]== Int[3,4,5]
true
isequal(NaN32, NaN64)
true
NaN==NaN
false
nan1=reinterpret(Float64, -1);nan2=reinterpret(Float64, -2);
isequal(nan1,nan2)
true

Yes, IEEE wants float comparison have this weird nan behavior. I don’t like it, and would rather have a is_ieee_equal (or is_approx_equal for a version that ignores types). If we identify 1:5 and collect(1:5), then soon "0"==0, and shortly afterwards 0=="0000e234".


#18

You seriously needn’t worry that this is on the edge of a slippery slope.

If you want to see if two values are indistinguishable, use ===.

If you want to see if two values are representing the same abstract value, use ==. Ok, “abstract value” is a little open to interpretation, but it’s obvious that 0 and 0.0 are representations of the same thing and behave similarly in mathematics. Similarly for [1,2,3] and (1:3) — we are actively trying to make every AbstractArray fully supported representations of an array that works like the builtin Array and is interchangeable. Ranges are simply an optimization. Strings are definitively not the same as some abstract mathematical object.

The exception here, of course, is that IEEE demands that 0.0 == -0.0 and NaN != NaN (and even worse, a non-symmetric <). That’s the sole reason for isequal — it “patches up” these deviations to match (most) programmer expectations in sorting, dictionaries, and such. We’ve actually considered swapping which functions do the “patching up” of these IEEE behaviors: https://github.com/JuliaLang/julia/issues/19148. It’s annoying and unfortunate, but I think we’re stuck with it.

If you want to see if they’re the same type, just check typeof(x) == typeof(y).


#19

Ah, but typeof(x)==typeof(y) && x==y is not recursive.

And I admit that Int[1,2,3]==Int[1,2,3] is a useful equality, just as Any[Int[1],2] == Any[Int[1],2]; otoh, Any[Any[1],2] is genuinely different. Also the fact that dictionaries don’t change under isequal if their insertion histories differ is nice.

Oh, btw, isequal and == also behave different on missing.

That’s good to hear… just noting that https://stackoverflow.com/questions/48270127/can-a-1-a-2-a-3-ever-evaluate-to-true applies to julia as well.


#20

It applies to Julia only in as much as it applies to any language with user-extensible equality and it does not apply to ===. However, in Julia, isequal is a genuine equivalence relation on all built-in types, and should be on user-defined types if they are defined correctly. The == operator is nearly an equivalence relation except for a few “mandatory” deviations like NaN != NaN (IEEE 754) and missing == x being missing (three-valued logic). Comparing equality in Julia to equality in PHP or Javascript is uncharitable and way off the mark – few languages have put as much thought into equality as we have.