Is there a package to compare if two DataFrames are the same?

Something that has more options than the == in Base for comparing dataframes? E.g. one where you can set the tolerance of difference in a float column. And also can compare columns by name even though they may be arranged in different orders?

Something like proc compare in SAS.

Not aware of anything baked in, but seems easy enough to do using isapprox?

You’d just have to decide whether you want to decide equality elementwise, columnwise or across the whole DataFrame (i.e. if say you have an absolute tolerance of 0.1, do you consider two dfs equal if all elements are withing 0.1 of each other but in total the difference either in a column or across all columns is larger than 0.1?)

Just do something like:

julia> using DataFrames

julia> x = rand(10);             

julia> x2 = copy(x); x2[end] += 0.01
0.7120634304599409

julia> df = DataFrame(a = x, b = x, c = rand(10));  

julia> df2 = DataFrame(a = x, b = x2, c = rand(10)); 

julia> isapprox.(df, df2) 
10Γ—3 DataFrame                                         
β”‚ Row β”‚ a    β”‚ b    β”‚ c    β”‚                           
β”‚     β”‚ Bool β”‚ Bool β”‚ Bool β”‚                          
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€                           
β”‚ 1   β”‚ 1    β”‚ 1    β”‚ 0    β”‚                           
β”‚ 2   β”‚ 1    β”‚ 1    β”‚ 0    β”‚                           
β”‚ 3   β”‚ 1    β”‚ 1    β”‚ 0    β”‚                           
β”‚ 4   β”‚ 1    β”‚ 1    β”‚ 0    β”‚                           
β”‚ 5   β”‚ 1    β”‚ 1    β”‚ 0    β”‚                           
β”‚ 6   β”‚ 1    β”‚ 1    β”‚ 0    β”‚                           
β”‚ 7   β”‚ 1    β”‚ 1    β”‚ 0    β”‚                           
β”‚ 8   β”‚ 1    β”‚ 1    β”‚ 0    β”‚                           
β”‚ 9   β”‚ 1    β”‚ 1    β”‚ 0    β”‚                           
β”‚ 10  β”‚ 1    β”‚ 0    β”‚ 0    β”‚ 

julia> isapprox.(df, df2, atol = 0.15)
10Γ—3 DataFrame                 
β”‚ Row β”‚ a    β”‚ b    β”‚ c    β”‚   
β”‚     β”‚ Bool β”‚ Bool β”‚ Bool β”‚ 
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€                           
β”‚ 1   β”‚ 1    β”‚ 1    β”‚ 0    β”‚                           
β”‚ 2   β”‚ 1    β”‚ 1    β”‚ 0    β”‚  
β”‚ 3   β”‚ 1    β”‚ 1    β”‚ 1    β”‚  
β”‚ 4   β”‚ 1    β”‚ 1    β”‚ 0    β”‚  
β”‚ 5   β”‚ 1    β”‚ 1    β”‚ 0    β”‚  
β”‚ 6   β”‚ 1    β”‚ 1    β”‚ 0    β”‚  
β”‚ 7   β”‚ 1    β”‚ 1    β”‚ 0    β”‚  
β”‚ 8   β”‚ 1    β”‚ 1    β”‚ 0    β”‚  
β”‚ 9   β”‚ 1    β”‚ 1    β”‚ 0    β”‚  
β”‚ 10  β”‚ 1    β”‚ 1    β”‚ 1    β”‚ 

and then use some combination of any and all depending on how you want to define whether they’re equal ot not.

3 Likes

isapprox unfortunately fails if the column is for instance of DateTime type.

It would be really great if there were a function which knows that DataFrames can have multiple types, some of which need isapprox, some of which need isequal.
In best case, this would be part of DataFrames.jl

I don’t see how this is related to DataFrames - if you are comparing collections of heterogeneous types, you need to decide what the right way to compare them is. You seem to want something like

mycompare(x::Number, y::Number) = isapprox(x, y)
mycompare(x, y) = isequal(x, y)

which is simple enough to define to suit your specific requirements.

1 Like

that is impressively simple, thank you for your help.

Still I think it would be a good addition to DataFrames.jl. There is more comparing two dataframes than comparing its values - also the columns needs to be compared.

But I admit, this little helper is very compact and self-explaining.

1 Like

DataFrames already has a definition of isapprox that does the right thing comparing columns by pairs:

1 Like

Ha, that function seems to have been added just when I proposed the isapprox solution above.

But this also doesn’t help Stephan, who wants non-numerical types to be compared by isequal, which this definition also doesn’t do:

julia> df1 = DataFrame(a = [1,2], b = ["a", "b"]); df2 = copy(df1);

julia> isapprox(df1, df2)
ERROR: MethodError: no method matching -(::String, ::String)
1 Like

Indeed, I was just replying to this part:

The scalar comparison issue remains but as you said above, that’s not a DataFrames issue.

Cool.

For me it is a DataFrame issue. DataFrames have different types of columns, by nature. Hence if someone shows me a method which is intended to work generically on DataFrames, I expect it to work on DataFrames with different kinds of columntypes.

The isapprox on DataFrames is hence a bit odd to me, as it only works for DataFrames of Numbers, but the signature really suggests that it works for all kinds of DataFrames.

The documentation says β€œisapprox with given keyword arguments applied to all pairs of columns” which implies that this method must be defined for all pairs of column types…

I agree it could be made more explicit in the documentation. But this restriction makes sense:

  • the column types can be anything, there’s no way DataFrames.jl can know them all.
  • DataFrames.jl is not the right place for defining isapprox on DateTime or other non-DataFrames types. Actually defining these methods in DataFrames.jl would be type piracy!
2 Likes

Yeah there is some danger of type-piracy in the air.

two ways I see how DataFrames isapprox could be implemented to still support different column types:

  1. While restricting isapprox to only Number columns won’t work apparently, one could make this a configuration (which may even default to Number, but can be set to any type, including of course Union types)

  2. You can have a try-catch fallback in case isapprox is not defined. That is probably slow, admitted, but would solve this.

In end-user code which is not a package used anywhere, you could probably also safely define the isapprox fallback yourself:

isapprox(a, b; kargs...) = isequal(a, b)

which at least semantically makes sense, as if something isequal, then it is also approximately equal.

But even though this is not type piracy, it is probably not a good idea to implement it in a package because people might be confused if no error is thrown for their non-number types.

I’m not sure it has the functionality you seek (didn’t look into it), but this package announcement seems to be aligned with your goal of better Data Frame comparisons. May want to add new features/requests there:

2 Likes