Possible bug in unique/Set

StefanKarpinski · August 16, 2017, 2:28pm

I should perhaps make a few points here. First of all, this has nothing to do with floating-point precision or rounding. It has everything to do with signed zeros being special and weird in the IEEE floating-point standards.

There are three standard equality relations in Julia:

=== – object identity / programmatic distinguishability (aka “egal”);
== / < – “intuitive” / IEEE equality and comparison;
isequal / isless – hashing & sorting.

The === relation is not user-definable and is guaranteed to be a true equivalence relation.

The == is not an equivalence relation because IEEE dictates this whacky nonsense with NaN != NaN. It also has -0.0 == 0.0 and !(-0.0 < 0.0) (thank goodness, if these disagreed it would be a true disaster). The == and < functions are user definable, so people do whatever they want, but == should generally be an equivalence relation other than NaN-like values, and it should be compatible with <, which should be a partial pre-order.

The isequal relation is not generally meant to be used directly, but it is used for hashing and sorting. The isequal predicate is an equivalence relation (transitive, reflexive), and it must be compatible with isless. In addition to the usual non-reflexive == relation, IEEE also defines a total ordering which is actually an equivalence relation and orders -0.0 strictly before 0.0 (i.e. not equal). For floating-point values, the isless and isequal relations generally follow this IEEE total ordering. Here’s where it gets “interesting”:

Since we use isless for sorting and IEEE dictates that -0.0 sorts before 0.0 we must have isless(-0.0, 0.0), otherwise -0.0 and 0.0 would end up randomly scrambled in the middle of sorted vectors.
Because isequal is compatible with isless – i.e. for all x and y of the same type, isless(x, y) or isequal(x, y) or isless(y, x) but not more than one – we must therefore have !isequal(-0.0, 0.0).

Thus the IEEE standard seems to require that -0.0 and 0.0 hash differently. There are some obvious questions…

Q: Why not just use different relations for hashing and sorting?

A: Hashing and sorting are often used as implementation details behind similar data structures. For example, a Dict uses a hash table – i.e. hash and isless – but a SortedDict probably uses a tree which uses the isless comparison predicate. If we used different relations for these data structures, then switching from a Dict to a SortedDict would not only change the key ordering (the obvious change), but it would also subtly change the bucketing behavior. Say, for example, we sorted -0.0 before 0.0 but considered them equal for hashing. Then they would be in the same bucket in a Dict but in different buckets in a SortedDict. That sort of subtle behavior change would far worse than the current consistent separation of -0.0 and 0.0.

Q: Why not use something besides hashing for sets? Isn’t this allowing the implementation detail that Set is implemented using Dict leak through?

A: We could use a different structure, but however we do it, equivalence has to be decided by a reflexive and transitive relation. Of the three equality-like relations in the language, only === and isequal are true equivalence relations. == is not because of NaN (at least). The ObjectIdDict type uses === but this is generally not what you want since, e.g. 1 !== 1.0, [1,2] !== [1,2]. === is not a very user-friendly relation, despite its fundamental correctness. That leaves isequal. The fact that a Dict is a fast, efficient way to bucket things with respect to the isequal relation makes Dict a sensible choice for implementing sets. But we could just as well use a vector and just linearly scan through to see if any value in the vector is already isequal to a new element – that would be slow but it would work similarly with respect to -0.0 and 0.0. So using isequal for bucketing isn’t really an implementation detail of Dict at all – it’s the only viable choice.

Q: Why not just ignore IEEE and have isequal(-0.0, 0.0) and !isless(-0.0, 0.0)? Let -0.0 and 0.0 be scrambled in the middle of sorted arrays – does anyone really care? Or hack in a pass at the beginning of sorting type floating-point arrays that separates values by sign first so they’re generally ordered correctly even though isless doesn’t dictate that they must be.

A: We’ve very seriously considered this. IEEE was very clearly not designed with the consequences for programming language semantics in mind. The fact that the standard equality relations is not reflexive is pretty terrible. The fact that -0.0 and 0.0 are equal – but not really! – is, in some ways, even worse (as you’re seeing here). So we could just say that -0.0 and 0.0 sort and hash the same in Julia, to hell with IEEE.

That would leave sorting in a bit of a strange situation, however. We would like for sorting a Vector{Float64} and sorting a Vector{Any} that happens to only contain values that are either Float64 or isequal to them to sort in indistinguishable orders. However, the former uses quicksort for speed – since you don’t need a stable sorting algorithm for floating-point values. Since NaNs are all isequal, we want them to end up in the same order they were originally in (to have the same effect as a stable sort). To accomplish this, we stably move the NaN values to the end of the Vector{Float64} as a first pass before sorting the rest of the numbers. It’s unclear what one would do about signed zeros if they didn’t compare differently. If we did nothing then the signed zeros in the middle of the sorted vector would be the only thing leaking the details of the unstable sorting algorithm at the end. We could try to do something like we do for NaNs, but it’s unclear how to do that efficiently – unlike NaNs, you don’t know in advance where in an array the zeros are supposed to end up. We could also sort signed zeros by their sign as a post-processing pass, but then we give up the equivalence of sorting a Vector{Float64} and a Vector{Any} with the same values in it. Perhaps that doesn’t matter, however.

This is a somewhat unlikely end point for the seemingly innocuous questions “Why do sets consider -0.0 and 0.0 to be different? Shouldn’t they be considered the same?” Despite concerns about deviating from IEEE and the sorting of signed zeros, we could very possibly make isequal(-0.0, 0.0). But this is not a small decision and it’s unclear that such a change wouldn’t lead to just as many problems and confusions – just different ones.

See https://github.com/JuliaLang/julia/issues/18485 where this option was decided against in a previous release cycle. That doesn’t mean this decision is final, but I’m not sure it’s a can of worms we want to reopen.