Is this a bug of the Julia function "unique"?

leon · October 21, 2021, 10:41am

I notice that any NaN values inside a data will mess up Julia’s unique function.

For example I have a matrix A with 3 duplicate rows:

A = [1 NaN 3; 1 NaN 3; 1 NaN 3];
3×3 Matrix{Float64}:
 1.0  NaN  3.0
 1.0  NaN  3.0
 1.0  NaN  3.0

If we apply unique to it as below:

B = unique(A, dims=1)

We would expect the B to be like this:

1×3 Matrix{Float64}:
 1.0  NaN  3.0

But in reality, the B generated by Julia is as below:

4×3 Matrix{Float64}:
 1.0  NaN  3.0
 1.0  NaN  3.0
 1.0  NaN  3.0
 1.0  NaN  3.0

Sukera · October 21, 2021, 11:13am

I suspect this is because

julia> NaN == NaN
false

, which is just regular floating point behavior. What I find more disturbing is this:

julia> unique(A, dims=2)
3×4 Matrix{Float64}:
 1.0  NaN  NaN  3.0
 1.0  NaN  NaN  3.0
 1.0  NaN  NaN  3.0

That should definitely not happen. Do you mind opening an issue for this on the issue tracker?

goerch · October 21, 2021, 11:13am

Without words;)

julia> NaN == NaN
false

woclass · October 21, 2021, 11:14am

julia> A = [1 NaN 3; 1 NaN 3; 1 NaN 3]
3×3 Matrix{Float64}:
 1.0  NaN  3.0
 1.0  NaN  3.0
 1.0  NaN  3.0

julia> unique(A)
3-element Vector{Float64}:
   1.0
 NaN
   3.0

:rofl:

Sukera · October 21, 2021, 11:17am

The single-argument version probably just iterates and uses isequal, while the dims version is more complicated and creates a hash per dimension - it’s unclear to me whether that is intended behavior for the multidimensional version, though it does make sense when you take NaN != NaN into account

leon · October 21, 2021, 12:05pm

Will do.

Thanks for pointing to that direction.

giordano · October 21, 2021, 12:15pm

That’s what the IEEE 754 standard mandates: floating point - What is the rationale for all comparisons returning false for IEEE754 NaN values? - Stack Overflow.

goerch · October 21, 2021, 12:17pm

Sorry for being sarcastic, I knew. Similar problem to SQL NULL values.

leon · October 21, 2021, 1:01pm

I don’t think it is sarcastic. We’re doing Julia a favor after all. Anyone who truly loves Julia would want to help get these issues fixed.

jling · October 21, 2021, 1:06pm

this (NaN != NaN) is NOT a bug, IEEE standards require it.

Don’t use NaN for this, use nothing maybe

Sukera · October 21, 2021, 1:07pm

No, this is expected behavior for NaN. That unique(A, dims=1) does not use isequal for this may be an issue, but NaN != NaN is very much intended. It works the same in other languages that use IEEE floating points, though their unique may have a different interpretation.

leon · October 21, 2021, 1:10pm

I can confirm that Matlab does not have the same “unique” issue, despite the fact that NaN is also considered different from NaN:

>> NaN == NaN
ans =
  logical
   0

Sukera · October 21, 2021, 1:14pm

Right - the question is whether julia should change its behavior here and whether that change would be breaking (meaning it could be done in 2.0 at the earliest).

Until then, may I ask what you were using that NaN for/how you encountered this? Julia has seperate missing and nothing values, to model the absence of a value (though one should exist, we just don’t know it) and the knowledge of absence (i.e. there is no value to represent the result). It does not rely on having to use NaN for a purpose it was never meant to be used for. See the docs for more information:

https://docs.julialang.org/en/v1/manual/missing/

rafael.guerra · October 21, 2021, 1:15pm

That NaN == NaN is false is maybe logical?
I.e., not being something does not imply being the same thing. Example:

NaN == NaN      # false 
x = 0/0         # NaN
y = Inf/Inf     # NaN
x == y          # false --> nice

leon · October 21, 2021, 1:34pm

I’m processing some oceanographic data right now. When people do a CTD cast at a particular sampling station, there is a Cast number associated with it. Sometimes, people leave it blank when there is only a cast.

giordano · October 21, 2021, 1:37pm

So you’re saying that you’re using NaN to represent “blank” values?

kristoffer.carlsson · October 21, 2021, 1:39pm

It is a bug because unique is documented to use isequal and

julia> isequal([1, NaN, 3], [1, NaN, 3])
true

https://github.com/JuliaLang/julia/pull/42737

Sukera · October 21, 2021, 1:42pm

Great! Then I’ll also open an issue about the unique(A, dims=2) oddity.

kristoffer.carlsson · October 21, 2021, 1:42pm

Isn’t that also fixed by the PR?

Sukera · October 21, 2021, 1:44pm

I haven’t checked, but the array grows for that case:

julia> A = [1 NaN 3; 1 NaN 3; 1 NaN 3]
3×3 Matrix{Float64}:
 1.0  NaN  3.0
 1.0  NaN  3.0
 1.0  NaN  3.0

julia> unique(A, dims=2)
3×4 Matrix{Float64}:
 1.0  NaN  NaN  3.0
 1.0  NaN  NaN  3.0
 1.0  NaN  NaN  3.0

and the unit test doesn’t cover that I guess I don’t see why == vs isequal should produce more values per row than existed previously.

Topic		Replies	Views
Various equalities of NaN? New to Julia question	13	1434	January 11, 2023
Possible bug in unique/Set Internals & Design faq	49	4397	June 15, 2018
Questions related to -0.0 and 0.0 General Usage question	9	515	October 21, 2021
Why is Julia designed this way? NaN != NaN but -0.0 == 0.0 New to Julia question	11	1046	November 11, 2021
Performance of `Union{Missing,Float64}` General Usage question	14	1147	May 25, 2021

Is this a bug of the Julia function "unique"?

Related topics