Error with `antijoin` when encountering `-0.0` value?

Hi! suppose I have a DataFrame where some of the entries have been rounded to -0.0.

df1 = DataFrame(x = [ 1, 0, -0.0])
3×1 DataFrame
 Row │ x       
     │ Float64 
─────┼─────────
   1 │     1.0
   2 │     0.0
   3 │    -0.0

julia> df2 = DataFrame(x = [ 1, 0, 0])
3×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     0
   3 │     0

when I try

julia> antijoin(df1,df2, on=(:x))

rather than en empty DataFrame the following error is raised

ERROR: ArgumentError: currently for numeric values NaN and `-0.0` in their real or imaginary components are not allowed. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.
Stacktrace:
 [1] DataFrames.DataFrameJoiner(dfl::DataFrame, dfr::DataFrame, on::Symbol, matchmissing::Symbol, kind::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/join/composer.jl:94
 [2] _join(df1::DataFrame, df2::DataFrame; on::Symbol, kind::Symbol, makeunique::Bool, indicator::Nothing, validate::Tuple{Bool, Bool}, left_rename::typeof(identity), right_rename::typeof(identity), matchmissing::Symbol, order::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/LteEl/src/join/composer.jl:497
 [3] #antijoin#674
   @ ~/.julia/packages/DataFrames/LteEl/src/join/composer.jl:1488 [inlined]
 [4] top-level scope
   @ REPL[260]:1

However when I try using CategoricalArrays as per Error raised I get the following result

julia> df4 = DataFrame(x = categorical([1,0,-0.0]))
3×1 DataFrame
 Row │ x    
     │ Cat… 
─────┼──────
   1 │ 1.0
   2 │ 0.0
   3 │ -0.0

julia> antijoin(df4,df2, on=(:x))
1×1 DataFrame
 Row │ x    
     │ Cat… 
─────┼──────
   1 │ -0.0

As opposed to an empty DataFrame. I’m a little confused by this because

df4.x[3]==df2.x[2]
true

and

0.0 == -0.0
true  

Just wondering if anyone could provide some insight as to where my understanding went wrong? Thanks!

Maybe there’s a small documentation “bug”: the antijoin documentation only says the following about the comparison function:

  • matchmissing : …; isequal is used for comparisons of
    rows for equality

Presumably this last sentence concerns all comparisons (including rows without missing values) and should be in the general part of the docstring, rather than the matchmissing bullet point. Maybe @bkamins can confirm.

Anyway, this explains the observed behavior:

julia> isequal(-0.0, 0)
false

I don’t know if there’s a way to make DataFrame consider 0 and -0.0 as equal.

However, making joins on floating point values is weird and rarely a good idea. In your particular case, maybe the right solution is to round the values to integers? See the difference in the two following examples:

julia> round.(x)
3-element Vector{Float64}:
  1.0
  0.0
 -0.0

julia> round.(Int, x)
3-element Vector{Int64}:
 1
 0
 0
1 Like

The reason is that == is not a test of equality used in joins.
The test of equality is isequal:

julia> isequal(0.0, -0.0)
false

This difference is exactly the reason why joining on -0.0 is by default disallowed as the result could confuse the user.

Note that the same is for e.g. Set or Dict:

julia> Set([0.0, -0.0])
Set{Float64} with 2 elements:
  0.0
  -0.0

julia> Dict(0.0=>1, -0.0=>2)
Dict{Float64, Int64} with 2 entries:
  0.0  => 1
  -0.0 => 2
1 Like

This is unrelated with DataFrames.jl. Julia Base does not consider 0.0 and -0.0 as equal. What DataFrames.jl does is just a consequence.

In general doing joins on floats should be discouraged. It is more reliable to use integers as @sijo suggested.

1 Like

Base.isequal doesn’t but Base.== does… I think it’s not obvious for users which function is used by DataFrames.jl for this kind of comparison (and whether the default function can be changed by the user) so maybe the corresponding doc sentence should be moved outside the matchmissing bullet point?

1 Like

It does, but this operator does not produce equivalence classes over values so it cannot be used in such context.

There are two reasons:

  1. == does not guarantee to return true or false.
  2. NaN != NaN (which breaks relexivity required for equivalence operator)

(I am writing this to higlight that it is not a choice if we use isequal or ==, and we chose isequal; the point is that == cannot be used)

Still - I understand your point that the fact that isequal is used can be more exposed in the documentation. Are you willing to propose a PR (if not I will do it).

3 Likes

Thank you both so much for the clarification! It is really helpful. Some of the data I have in columns is rounded out to 2 or 3 decimals so I don’t think I can use integers. Would there be some other work around to use joins (other than scaling the values)?

If I replace all the -0.0 values with 0, are there other reasons that joins should be discouraged when using floats?

The other case is NaN values.
In general maybe you will prefer to work with GitHub - JuliaMath/FixedPointDecimals.jl: Julia fixed-point decimals built from integers as they are designed for your use case.

1 Like

Sure :slight_smile: done here.

1 Like

Thank you both so much!