Is this a bug of the Julia function "unique"?

I notice that any NaN values inside a data will mess up Julia’s unique function.

For example I have a matrix A with 3 duplicate rows:

A = [1 NaN 3; 1 NaN 3; 1 NaN 3];
3×3 Matrix{Float64}:
 1.0  NaN  3.0
 1.0  NaN  3.0
 1.0  NaN  3.0

If we apply unique to it as below:

B = unique(A, dims=1)

We would expect the B to be like this:

1×3 Matrix{Float64}:
 1.0  NaN  3.0

But in reality, the B generated by Julia is as below:

4×3 Matrix{Float64}:
 1.0  NaN  3.0
 1.0  NaN  3.0
 1.0  NaN  3.0
 1.0  NaN  3.0
1 Like

I suspect this is because

julia> NaN == NaN
false

, which is just regular floating point behavior. What I find more disturbing is this:

julia> unique(A, dims=2)
3×4 Matrix{Float64}:
 1.0  NaN  NaN  3.0
 1.0  NaN  NaN  3.0
 1.0  NaN  NaN  3.0

That should definitely not happen. Do you mind opening an issue for this on the issue tracker?

4 Likes

Without words;)

julia> NaN == NaN
false
1 Like
julia> A = [1 NaN 3; 1 NaN 3; 1 NaN 3]
3×3 Matrix{Float64}:
 1.0  NaN  3.0
 1.0  NaN  3.0
 1.0  NaN  3.0

julia> unique(A)
3-element Vector{Float64}:
   1.0
 NaN
   3.0

:rofl:

1 Like

The single-argument version probably just iterates and uses isequal, while the dims version is more complicated and creates a hash per dimension - it’s unclear to me whether that is intended behavior for the multidimensional version, though it does make sense when you take NaN != NaN into account :thinking:

2 Likes

Will do.

Thanks for pointing to that direction.

That’s what the IEEE 754 standard mandates: floating point - What is the rationale for all comparisons returning false for IEEE754 NaN values? - Stack Overflow.

7 Likes

Sorry for being sarcastic, I knew. Similar problem to SQL NULL values.

1 Like

I don’t think it is sarcastic. We’re doing Julia a favor after all. Anyone who truly loves Julia would want to help get these issues fixed.

this (NaN != NaN) is NOT a bug, IEEE standards require it.

Don’t use NaN for this, use nothing maybe

5 Likes

No, this is expected behavior for NaN. That unique(A, dims=1) does not use isequal for this may be an issue, but NaN != NaN is very much intended. It works the same in other languages that use IEEE floating points, though their unique may have a different interpretation.

3 Likes

I can confirm that Matlab does not have the same “unique” issue, despite the fact that NaN is also considered different from NaN:

>> NaN == NaN
ans =
  logical
   0
1 Like

Right - the question is whether julia should change its behavior here and whether that change would be breaking (meaning it could be done in 2.0 at the earliest).

Until then, may I ask what you were using that NaN for/how you encountered this? Julia has seperate missing and nothing values, to model the absence of a value (though one should exist, we just don’t know it) and the knowledge of absence (i.e. there is no value to represent the result). It does not rely on having to use NaN for a purpose it was never meant to be used for. See the docs for more information:

https://docs.julialang.org/en/v1/manual/missing/

2 Likes

That NaN == NaN is false is maybe logical?
I.e., not being something does not imply being the same thing. Example:

NaN == NaN      # false 
x = 0/0         # NaN
y = Inf/Inf     # NaN
x == y          # false --> nice
2 Likes

I’m processing some oceanographic data right now. When people do a CTD cast at a particular sampling station, there is a Cast number associated with it. Sometimes, people leave it blank when there is only a cast.

So you’re saying that you’re using NaN to represent “blank” values?

It is a bug because unique is documented to use isequal and

julia> isequal([1, NaN, 3], [1, NaN, 3])
true

https://github.com/JuliaLang/julia/pull/42737

12 Likes

Great! Then I’ll also open an issue about the unique(A, dims=2) oddity.

Isn’t that also fixed by the PR?

I haven’t checked, but the array grows for that case:

julia> A = [1 NaN 3; 1 NaN 3; 1 NaN 3]
3×3 Matrix{Float64}:
 1.0  NaN  3.0
 1.0  NaN  3.0
 1.0  NaN  3.0

julia> unique(A, dims=2)
3×4 Matrix{Float64}:
 1.0  NaN  NaN  3.0
 1.0  NaN  NaN  3.0
 1.0  NaN  NaN  3.0

and the unit test doesn’t cover that :man_shrugging: I guess I don’t see why == vs isequal should produce more values per row than existed previously.

1 Like