Array hashing is complicated

foobar_lv2 · January 24, 2018, 11:59pm

…if equality always checked types first, then it would not apply (but people could still break symmetry, transitivity or reflexivity for their types).

I agree; it is uncharitable and the situation in julia is somewhat better, due to the awesomeness of multiple dispatch and the thought that has been put into it (and at least none of the type-wrong equalities are as nonsensical as string to number).

Type-unsafe equality is scary and the road to hell (also I’m not sure how much inference loses due to branches like if x==1 not giving type information about x. Of course you can argue that it should be if x===1, but at least base tends to use == in cases where === would also work).

Regardless, I apologize for being too negative. Julia is way, way better than php or javascript, I am thankful that the core devs like you, jeff, keno, yuichao, etc, comment and explain here, and am even more thankful for you developing julia.

ScottPJones · January 25, 2018, 12:55am

That really is not nonsensical, it just depends on the architecture of the language.
In a language where semantically everything is a string, it makes perfect since for “0” == 0, or
1E5 == “100000”. (That’s the way the most used language for healthcare systems, at least in the US, which was the 3rd ANSI standard language, does things).

StefanKarpinski · January 25, 2018, 1:06am

The problem with an object being equal to 1, 2 and 3 at the same time is that it’s a violation of transitivity, not that objects shouldn’t be equal to integers. Saying that == should check types doesn’t solve this in any meaningful way since a user can define == to violate that recommendation just as easily as they can violate the current rule that equality should be pure and transitive. If people follow the rule then a == 1 && a == 2 cannot happen.

We tried having == and isequal check types first and it had serious problems – it does not work well in a dynamic languages like Julia. The history: in 0.2 we made values of different types unequal and then in 0.3 we went back to making equal values of different types equal with a much more disciplined approach.

It may not immediately obvious why having 1 != 1.0 is a problem in Julia, but it interacts really badly with other features of the language. One issue is the interaction with auto-conversion of values upon assignment to typed locations. For example, you can do something like this:

julia> d = Dict{Float64,Bool}()
Dict{Float64,Bool} with 0 entries

julia> d[1] = true
true

julia> d
Dict{Float64,Bool} with 1 entry:
  1.0 => true

The integer 1 is automatically converted to 1.0 because the key type of the dict is Float64. But now consider what would happen if you had isequal(1, 1.0) == false . Immediately after doing that assignment, you’d get a key error upon doing d[1] – even though that’s the exact the key you just put into the dictionary. This is because you can’t find 1.0 in a dict with a value that’s not isequal to it. Now consider a potentially more loosely type dictionary:

d[1] = true
d[1.0] = true

How many entries does the dictionary have now? Depends on its key type! If it’s Float64 or Int then you only have one; if it’s Real or Any then you have two. Fun.

There are other potential ways to solve these problems than making 1 == 1.0. We could get rid of conversion on assignment to typed locations, for example. Then you would simply get an error when doing d[1] = true for a dict with Float64 key type. But that starts to make the language feel very strict and fussy. Instead, the design we have makes this work in a coherent way and does – basically universally – what people feel like it should. A lot of thought and work has gone into making it that way.

Type-unsafe equality is scary and the road to hell

Calling what we do now “type-unsafe equality” is scaremongering and also just incorrect terminology – type safety is a real thing and has nothing to do with this. Our equality has worked this way since Julia 0.3 and it doesn’t have any obvious issues. What actual problems have you found with it?

StefanKarpinski · January 25, 2018, 1:08am

If "00" == 0 then this causes a fairly major transitivity problem since presumably you do not want to have "0" == "00".

ScottPJones · January 25, 2018, 1:26am

No, "00" is not a “canonic numeric” string, and hence is not equivalent.
Applying a numeric operator, or the unary +, will convert a non-canonic string to a canonic number
(and internally, there are numeric types, but they are just an optimization).
All arithmetic is done as decimal floating point, so that there are no conversion issues between numeric strings and internal numeric type values.

ScottPJones · January 25, 2018, 1:37am

We ran into a problem with Dict{Any,Any}, where Bool keys ended up as 0 and 1 instead of staying as separate entries as false and true. (We were trying to keep track of the values found in a JSON file, with null, true, false, being considered distinct from numbers and strings).

If Bool weren’t <: Number, then it wouldn’t have been a problem.

StefanKarpinski · January 25, 2018, 3:13am

The problem there is that Bool <: Number which we discussed changing but decided it was a bit too disruptive this late in the game. For that sort of application you should probably be using an IdDict instead.

mauro3 · January 25, 2018, 8:03am

There is some interesting reading related to equality in this (closed) pull-request: https://github.com/JuliaLang/julia/pull/18856.

Tamas_Papp · January 25, 2018, 9:13am

and of course the timeless classic of Kent M. Pitman, with the conclusion

in occasional bug reports that vendors receive, arguing that an incorrect choice has ``clearly’’ been made for how objects of a given type are compared, rather than acknowledging that the choice is really quite arbitrary. When urged to write their own equality predicate to suit their particular needs, they sometimes react as if we are putting them off, rather than realizing that any function they could write is just as valid as any one the language provides. Were the language changed to accommodate such bug reports, different users would probably complain.

I think that the best a language can do is to

provide various commonly used notions of equality with a clear conceptual model, so that the programmer can pick what’s best for each case,
make the semantics of functions that rely on some concept of equality as extensible as possible.

Ideally, hash tables like Dict could be customizable for different concepts of equality and corresponding hashes, and various trade-offs with hashing sophistication and collision, but that is something that could live in a package when someone gets around to implementing it.

ScottPJones · January 25, 2018, 11:49am

No, not at all, because we need numbers, even of different types, to be compared by their value, same with strings.

Would it really be that disruptive, if it weren’t <: Number, but appropriate conversions and operators with Number types were defined?
This would be the time to do it, along with all the other disruptive changes.

Edit: To clarify, I am not opposed to the disruptive changes, as painful as they might be at the moment,
I think that the work done now to improve the consistency and ease of use of the APIs, as well as moving stuff out of what is the “core” language, is the best thing possible for Julia in the long run.

foobar_lv2 · January 25, 2018, 12:00pm

Thanks for the explanation of the history! Just leaving this here…

Float32(e)==Float64(Float32(e))
true
Float32(pi)+Float32(e)==Float32(pi) + Float64(Float32(e))
false

Ptr{Int}(1)==Ptr{UInt8}(1)
…so pointers need to be checked with === before you deref them. I have a feeling that bugs due to type-wrong equality are much, much harder to spot than bugs due to same-value-for-my-purpose-different-type (if you want to compare addresses, reinterpret).

So some built-in types have a very… relaxed notion of equality. OTOH, default equality apparently means === on user types.

Now, in 07, === sometimes checks for identity (in the intensional sense of machine-representation) and sometimes not:

BigInt(1)===BigInt(1)
false
"foo"==="foo"
true #false on 0.6
pointer("foo")==pointer("foo")
false

This whole DWIM equality/identity makes me very nervous.

Tamas_Papp · January 25, 2018, 12:44pm

I am curious what you expected here, since

julia> (Float32(pi)+Float32(e))-(Float32(pi) + Float64(Float32(e)))
2.384185791015625e-7

StefanKarpinski · January 25, 2018, 1:38pm

Indeed, if anything, given the context of the thread, this shows that Julia is very careful to make sure that == is an equivalence relation. If this comparison was done by converting to Float32 then it would violate transitivity. Here’s another fun one:

julia> 2^53+1 == 2.0^53
false

julia> Float64(2^53+1) == 2.0^53
true

Why doesn’t == just convert to Float64 and then compare? Do you see the problem? Other fun tricks:

julia> 1/10
0.1

julia> 1/10 == 1//10
false

What?! One tenth is not representable exactly as a Float64, so 1/10 – i.e. the closest Float64 to one tenth – cannot be exactly equal to 1//10 which represents one tenth exactly. Sure, we get the occasional bug report about this, but if we didn’t do this, then we’d rapidly get PHP/JavaScript-esque violations of basic properties that == should obey (namely transitivity). There is a coherent philosophy here, and it is very strictly obeyed.

foobar_lv2:

Ptr{Int}(1)==Ptr{UInt8}(1)
…so pointers need to be checked with === before you deref them. I have a feeling that bugs due to type-wrong equality are much, much harder to spot than bugs due to same-value-for-my-purpose-different-type (if you want to compare addresses, reinterpret).

This is a judgement, but we treat the type of a pointer as an essential part of its identity unlike numbers which represents some mathematical value which is independent of their type or representation. The number 1 is the same no matter whether it’s represented as an Int or a Float64 or a Complex{UInt8}. We do not, however, treat 1 with a unit the same as 1:

julia> using Unitful: m

julia> 1 == 1m
false

julia> 1m == 1m^2
false

Pointers are a bit like united quantities – a pointer to an Int64 is different than a pointer to a Float64 even if they point at the same location. You get different things if you load from those pointers, for example.

So some built-in types have a very… relaxed notion of equality. OTOH, default equality apparently means === on user types.

There is a quite old open 1.0 issue to change the default == fallbcack. I am in favor of changing it to something a bit more intuitive and commonly wanted. The current choice of === is basically the most conservative choice possible since it will never consider two things equal unless they are programmatically indistinguishable. Unfortunately, this is often stricter than what one wants.

Now, in 07, === sometimes checks for identity (in the intensional sense of machine-representation) and sometimes not:
BigInt(1)===BigInt(1)
false
"foo"==="foo"
true #false on 0.6
pointer("foo")==pointer("foo")
false
This whole DWIM equality/identity makes me very nervous.

There’s not really any wiggle room for ===: it implements Henry Baker’s “egal” predicate, which is very well-defined. The predicate is true if two values are programmatically indistinguishable from within a language. Changing String === went hand in hand with changing its representation so that you can no longer programmatically distinguish two different String objects except by doing unsafe operations like taking pointers and writing through them. A pointer(str) call will always give you a pointer to some memory that holds bytes equal to the string you passed it, but it’s not guaranteed to give you a pointer to the exact memory of the string you passed it. In other words, the impurity here belongs to the pointer function, not the String type.

Two different BigInt values representing the same value are programmatically distinguishable because this is the representation of a BigInt:

julia> dump(big(1))
BigInt
  alloc: Int32 1
  size: Int32 1
  d: Ptr{UInt64} Ptr{UInt64} @0x00007fd3d56caff0

julia> dump(big(1))
BigInt
  alloc: Int32 1
  size: Int32 1
  d: Ptr{UInt64} Ptr{UInt64} @0x00007fd3d7b28b80

These two instances of big(1) have different pointers, therefore they are not ===. They are, of course == because they represent the same value. In a future version of Julia, we might integrate BigInts more deeply into the language and garbage collection and make big(1) an immutable value without any exposed pointers (and use a hybrid immediate representation for small integers), but we don’t do that yet and === does not lie.

So far all of these examples have been good ones, but imo, they actually show that there is a well-considered and thorough policy for equality in Julia, rather than the opposite. The === operator is not user-extensible because it has a well-defined meaning which should cannot be changed by the user: it checks programmatic indistinguishability. The == operator implements value comparison independent of representation. If you want to know if x == y should be true, consider if they represent the same abstract value or not. There is some interpretation involved for user-defined types, but if you’re implementing a number type, for example, it’s pretty clear. If you have some other kind of type, you will have to decide what it represents and think carefully on what an equivalence relation on that abstraction should look like.

ScottPJones · January 25, 2018, 2:01pm

The only place I’ve found that I don’t think works well, and seems to go against not allowing things like:
5 ? 1 : 0, is that 1 == true and 0 == false both return true (really same issue as the dict issue with Bool vs. numbers)

Edit: to clarify, I do think (with this one exception) that Julia’s concepts of == vs. === are rather well thought out, more so than I’ve seen in most other places (well, MUMPS was also well thought out, equality was always based the string representation, no matter any internal optimizations for numbers)

foobar_lv2 · January 25, 2018, 2:08pm

Float32(pi)+Float32(e)==Float32(pi) + Float64(Float32(e))
false

is correct. Automatic promotion on addition is expected. Automatic promotion on comparison is a nightmare, and I am happy that julia does not do this.

Naively one would expect isequal(a,b) to imply isequal(a+x, b+x). This does not hold, because typeof(a) != typeof(b), and hence we get different dispatches / promotions for +(a,x) and +(b,x). DWIM equality is nice for productivity in throw-away code (just compare type-vomited objects); it is dangerous with respect to weird bugs (the above extends to all numeric types like int, uint of various types, and all vectors holding such types) and makes the language “specification” extremely complex, i.e. makes actual behavior hard to predict.

I apologize for bringing up a topic that surely was discussed to death at the time. This shows how inexperienced I am in julia (I wrongly used isequal in all my code, even though I wanted typed isequal; typical mistake like people using == when they should use === in e.g. javascript or php.) Base is not very helpful here (as a source of example code to emulate), because it mostly uses == as default, and one needs to look at the code_typed in order to see whether this is a bug.

I get the following:

Ptr{Int}(1)==Ptr{Float32}(1)
true

I agree that pointers should carry their type! I was complaining that they do not!

So passing strings-pointers to mutating C libraries is not allowed?

I agree; the DWIM-equality policy is well-considered and looks surprisingly good (few weird corners I’ve seen so far) and the fact that you managed fast DWIM-hashing is impressive. I think I am not whining about the way equality works (except for pointer types) but rather about the general decision that two objects of different types can be equal at all.

ScottPJones · January 25, 2018, 2:16pm

Strings in Julia are immutable. You could do it, but it’s not at all supported, and probably would break something.

StefanKarpinski · January 25, 2018, 4:00pm

I thought they were, perhaps we should change that. Then again, I’m not sure how important this really is. We don’t exactly encourage programming with pointers in Julia.

So passing strings-pointers to mutating C libraries is not allowed?

All things are mutable in computers. If you sneak around behind the language’s back to mutate things that are supposed to be immutable, you’re on your own.

the DWIM-equality policy is well-considered and looks surprisingly good

Calling it “DWIM” seems to suggest that it has some kind of arbitrary “do what I mean” behavior. It does not: it is abstract-value-based equality, which is not DWIM at all (as shown by many of the surprising corner cases). The only judgement call is what “abstract value” something represents, but as long as you’re consistent about that, == will always be an equivalence relation since identity of abstract values is an equivalence relation.

the fact that you managed fast DWIM-hashing is impressive

Yes, that was quite tricky.

Keno · January 25, 2018, 10:44pm

The type on pointers is more of a suggestion to the unsafe_load/store functions than objective reality. In particular arithmetic is intentionally always bytewise and deliberately excluded from TBAA, so arbitrary pointer type-casting is legal (and happens implicitly via convert). It would be bit odd for the implicit conversion to be allowed, but for them to not be considered equal. Pointers are basically in the language to interface with native code.

If you want stronger notions of pointer, that’s fine, but Ptr is not it.

jeff.bezanson · January 26, 2018, 2:20am

You want to avoid involving object addresses in equality as much as possible, because it exposes implementation details. It means implementing an optimization (e.g. storing all strings equal to "foo" at the same address) or an unintentional change can break people’s programs. See for example the eq function in common lisp and python (v2?) which is totally unreliable for certain arguments because of this.

jameson · January 26, 2018, 2:22am

I don’t think we should allow pointer-conversions as freely as we do, it has just never managed to be high on my priority list. But it’s just weird to be more permissive than C.

Basically nothing. Having x == 1 mean x::Int == 1 adds vanishingly little information in a non-Hindley–Milner type-system.

Sounds somewhat logical on the surface, but wouldn’t that be terribly slow, and susceptible to changes in the string representation of an object (e.g., if I decide to change the printing of 1.0 to 1 or 1.0e0)? It feels like that definition is recursive when you want to go actually implement it.

Topic		Replies	Views
Alternate approach to O(log(N)) integer range hashing Internals & Design proposal	7	1112	February 16, 2018
Inconsistent hashing of types containing `Set` or `Array` fields General Usage	5	544	March 22, 2018
Alternatives to `Base.hash`? General Usage	14	950	April 11, 2022
Hashing of different types with same contents General Usage	2	283	February 8, 2021
Internals of types and how they hash Internals & Design	10	417	February 9, 2024

Array hashing is complicated

Related topics