Is there a package to compare if two DataFrames are the same?

xiaodai · August 6, 2020, 12:40am

Something that has more options than the == in Base for comparing dataframes? E.g. one where you can set the tolerance of difference in a float column. And also can compare columns by name even though they may be arranged in different orders?

Something like proc compare in SAS.

nilshg · August 6, 2020, 2:23pm

Not aware of anything baked in, but seems easy enough to do using isapprox?

You’d just have to decide whether you want to decide equality elementwise, columnwise or across the whole DataFrame (i.e. if say you have an absolute tolerance of 0.1, do you consider two dfs equal if all elements are withing 0.1 of each other but in total the difference either in a column or across all columns is larger than 0.1?)

Just do something like:

julia> using DataFrames

julia> x = rand(10);             

julia> x2 = copy(x); x2[end] += 0.01
0.7120634304599409

julia> df = DataFrame(a = x, b = x, c = rand(10));  

julia> df2 = DataFrame(a = x, b = x2, c = rand(10)); 

julia> isapprox.(df, df2) 
10×3 DataFrame                                         
│ Row │ a    │ b    │ c    │                           
│     │ Bool │ Bool │ Bool │                          
├─────┼──────┼──────┼──────┤                           
│ 1   │ 1    │ 1    │ 0    │                           
│ 2   │ 1    │ 1    │ 0    │                           
│ 3   │ 1    │ 1    │ 0    │                           
│ 4   │ 1    │ 1    │ 0    │                           
│ 5   │ 1    │ 1    │ 0    │                           
│ 6   │ 1    │ 1    │ 0    │                           
│ 7   │ 1    │ 1    │ 0    │                           
│ 8   │ 1    │ 1    │ 0    │                           
│ 9   │ 1    │ 1    │ 0    │                           
│ 10  │ 1    │ 0    │ 0    │ 

julia> isapprox.(df, df2, atol = 0.15)
10×3 DataFrame                 
│ Row │ a    │ b    │ c    │   
│     │ Bool │ Bool │ Bool │ 
├─────┼──────┼──────┼──────┤                           
│ 1   │ 1    │ 1    │ 0    │                           
│ 2   │ 1    │ 1    │ 0    │  
│ 3   │ 1    │ 1    │ 1    │  
│ 4   │ 1    │ 1    │ 0    │  
│ 5   │ 1    │ 1    │ 0    │  
│ 6   │ 1    │ 1    │ 0    │  
│ 7   │ 1    │ 1    │ 0    │  
│ 8   │ 1    │ 1    │ 0    │  
│ 9   │ 1    │ 1    │ 0    │  
│ 10  │ 1    │ 1    │ 1    │

and then use some combination of any and all depending on how you want to define whether they’re equal ot not.

schlichtanders · January 15, 2024, 11:03am

isapprox unfortunately fails if the column is for instance of DateTime type.

It would be really great if there were a function which knows that DataFrames can have multiple types, some of which need isapprox, some of which need isequal.
In best case, this would be part of DataFrames.jl

nilshg · January 15, 2024, 11:46am

I don’t see how this is related to DataFrames - if you are comparing collections of heterogeneous types, you need to decide what the right way to compare them is. You seem to want something like

mycompare(x::Number, y::Number) = isapprox(x, y)
mycompare(x, y) = isequal(x, y)

which is simple enough to define to suit your specific requirements.

schlichtanders · January 15, 2024, 11:57am

that is impressively simple, thank you for your help.

Still I think it would be a good addition to DataFrames.jl. There is more comparing two dataframes than comparing its values - also the columns needs to be compared.

But I admit, this little helper is very compact and self-explaining.

sijo · January 15, 2024, 12:30pm

DataFrames already has a definition of isapprox that does the right thing comparing columns by pairs:

github.com

JuliaData/DataFrames.jl/blob/3e290274d3c201e8bfe903f0d326e78c38fc0fef/src/abstractdataframe/abstractdataframe.jl#L514-L533


      
              isapprox(df1::AbstractDataFrame, df2::AbstractDataFrame;
                       rtol::Real=atol>0 ? 0 : √eps, atol::Real=0,
                       nans::Bool=false, norm::Function=norm)
          
          Inexact equality comparison. `df1` and `df2` must have the same size and column names.
          Return  `true` if `isapprox` with given keyword arguments
          applied to all pairs of columns stored in `df1` and `df2` returns `true`.
          """
          function Base.isapprox(df1::AbstractDataFrame, df2::AbstractDataFrame;
                                 atol::Real=0, rtol::Real=atol>0 ? 0 : √eps(),
                                 nans::Bool=false, norm::Function=norm)
              if size(df1) != size(df2)
                  throw(DimensionMismatch("dimensions must match: a has dims " *
                                          "$(size(df1)), b has dims $(size(df2))"))
              end
              if !isequal(index(df1), index(df2))
                  throw(ArgumentError("column names of passed data frames do not match"))
              end
              return all(isapprox.(eachcol(df1), eachcol(df2), atol=atol, rtol=rtol, nans=nans, norm=norm))
          end

nilshg · January 15, 2024, 1:29pm

Ha, that function seems to have been added just when I proposed the isapprox solution above.

But this also doesn’t help Stephan, who wants non-numerical types to be compared by isequal, which this definition also doesn’t do:

julia> df1 = DataFrame(a = [1,2], b = ["a", "b"]); df2 = copy(df1);

julia> isapprox(df1, df2)
ERROR: MethodError: no method matching -(::String, ::String)

sijo · January 15, 2024, 1:53pm

Indeed, I was just replying to this part:

The scalar comparison issue remains but as you said above, that’s not a DataFrames issue.

schlichtanders · January 15, 2024, 2:24pm

Cool.

For me it is a DataFrame issue. DataFrames have different types of columns, by nature. Hence if someone shows me a method which is intended to work generically on DataFrames, I expect it to work on DataFrames with different kinds of columntypes.

The isapprox on DataFrames is hence a bit odd to me, as it only works for DataFrames of Numbers, but the signature really suggests that it works for all kinds of DataFrames.

sijo · January 15, 2024, 2:46pm

The documentation says “isapprox with given keyword arguments applied to all pairs of columns” which implies that this method must be defined for all pairs of column types…

I agree it could be made more explicit in the documentation. But this restriction makes sense:

the column types can be anything, there’s no way DataFrames.jl can know them all.
DataFrames.jl is not the right place for defining isapprox on DateTime or other non-DataFrames types. Actually defining these methods in DataFrames.jl would be type piracy!

schlichtanders · January 15, 2024, 2:59pm

Yeah there is some danger of type-piracy in the air.

two ways I see how DataFrames isapprox could be implemented to still support different column types:

While restricting isapprox to only Number columns won’t work apparently, one could make this a configuration (which may even default to Number, but can be set to any type, including of course Union types)
You can have a try-catch fallback in case isapprox is not defined. That is probably slow, admitted, but would solve this.

In end-user code which is not a package used anywhere, you could probably also safely define the isapprox fallback yourself:

isapprox(a, b; kargs...) = isequal(a, b)

which at least semantically makes sense, as if something isequal, then it is also approximately equal.

But even though this is not type piracy, it is probably not a good idea to implement it in a package because people might be confused if no error is thrown for their non-number types.

Nathan_Boyer · January 15, 2024, 3:16pm

I’m not sure it has the functionality you seek (didn’t look into it), but this package announcement seems to be aligned with your goal of better Data Frame comparisons. May want to add new features/requests there:

Topic		Replies	Views
Is there a package to do DataFrames comparisons? General Usage dataframes	4	1307	August 3, 2021
Compare dataframes regardless of column order General Usage dataframes	8	186	February 4, 2025
How to supplement a function for equality check when joining the DataFrames Data dataframes	2	93	July 5, 2024
How to compare non-missing elements of two DataFrames New to Julia	3	568	July 1, 2020
Various constructors and equality for DataFrame Data question	5	1154	January 18, 2017

Is there a package to compare if two DataFrames are the same?

Related topics