Something that has more options than the == in Base for comparing dataframes? E.g. one where you can set the tolerance of difference in a float column. And also can compare columns by name even though they may be arranged in different orders?
Not aware of anything baked in, but seems easy enough to do using isapprox?
Youβd just have to decide whether you want to decide equality elementwise, columnwise or across the whole DataFrame (i.e. if say you have an absolute tolerance of 0.1, do you consider two dfs equal if all elements are withing 0.1 of each other but in total the difference either in a column or across all columns is larger than 0.1?)
isapprox unfortunately fails if the column is for instance of DateTime type.
It would be really great if there were a function which knows that DataFrames can have multiple types, some of which need isapprox, some of which need isequal.
In best case, this would be part of DataFrames.jl
I donβt see how this is related to DataFrames - if you are comparing collections of heterogeneous types, you need to decide what the right way to compare them is. You seem to want something like
that is impressively simple, thank you for your help.
Still I think it would be a good addition to DataFrames.jl. There is more comparing two dataframes than comparing its values - also the columns needs to be compared.
But I admit, this little helper is very compact and self-explaining.
For me it is a DataFrame issue. DataFrames have different types of columns, by nature. Hence if someone shows me a method which is intended to work generically on DataFrames, I expect it to work on DataFrames with different kinds of columntypes.
The isapprox on DataFrames is hence a bit odd to me, as it only works for DataFrames of Numbers, but the signature really suggests that it works for all kinds of DataFrames.
The documentation says βisapprox with given keyword arguments applied to all pairs of columnsβ which implies that this method must be defined for all pairs of column typesβ¦
I agree it could be made more explicit in the documentation. But this restriction makes sense:
the column types can be anything, thereβs no way DataFrames.jl can know them all.
DataFrames.jl is not the right place for defining isapprox on DateTime or other non-DataFrames types. Actually defining these methods in DataFrames.jl would be type piracy!
Yeah there is some danger of type-piracy in the air.
two ways I see how DataFrames isapprox could be implemented to still support different column types:
While restricting isapprox to only Number columns wonβt work apparently, one could make this a configuration (which may even default to Number, but can be set to any type, including of course Union types)
You can have a try-catch fallback in case isapprox is not defined. That is probably slow, admitted, but would solve this.
In end-user code which is not a package used anywhere, you could probably also safely define the isapprox fallback yourself:
isapprox(a, b; kargs...) = isequal(a, b)
which at least semantically makes sense, as if something isequal, then it is also approximately equal.
But even though this is not type piracy, it is probably not a good idea to implement it in a package because people might be confused if no error is thrown for their non-number types.
Iβm not sure it has the functionality you seek (didnβt look into it), but this package announcement seems to be aligned with your goal of better Data Frame comparisons. May want to add new features/requests there: