Compare dataframes regardless of column order

fran94 · February 3, 2025, 10:13am

Hi!
I’m probably doing something wrong but I cannot seem to find an easy solution to this. Imagine I have a dataframe (e.g. the Palmer penguins data) and I convert it from wide to tidy format:

using PalmerPenguins, DataFrames

df = DataFrame(PalmerPenguins.load())

#Create an ID column
df.id = 1:size(df,1)
df_long = stack(df, Not([:species, :id])

Now if I want to go back to the original table, I should do something like:

df_wide = unstack(df_long, :variable, :value)

This is all good, except that while I know that df contains the same data as df_wide, both isequal(df, df_wide) and isequal.(df, df_wide) fail (either by returning false or saying that column names cannot be broadcasted since they are not in the same order). Is there a way to compare two dataframes regardless of column order (instead of me writing a for loop to compare each column separately)? Thanks!

jules · February 3, 2025, 10:20am

Maybe use a dict to handle the order invariance?

df1 = DataFrame(a = [1, 2], b = [3, 4])
df2 = DataFrame(b = [3, 4], a = [1, 2])

Dict(names(df1) .=> eachcol(df1)) == Dict(names(df2) .=> eachcol(df2))
# true

fran94 · February 3, 2025, 10:44am

This does work and does what I want it to do. I still think that in general an equality relationship between dataframes should not depend on order (unlike e.g. matrices), so maybe I could ask whether there is something that can be done from the package directly to be more user-friendly. Thank you!

MatthijsCox · February 3, 2025, 2:52pm

I guess it would be nice to have functions like issubset and issubsetequal for a DataFrame, but I couldn’t find such functions in DataFrames.jl yet.

One of us would have to open a PR in DataFrames.jl to get this feature in. But I bet there are all kinds of edge cases to consider. Like do you really want ‘set’ equality for your case, or you also want to check for the same number of duplicate rows?

rocco_sprmnt21 · February 3, 2025, 9:11pm

Set(eachcol(df)) == Set(eachcol(df_wide))

rafael.guerra · February 3, 2025, 11:30pm

And when it is important to differentiate the column names as well:

Set(zip(names(df1), eachcol(df1))) == Set(zip(names(df2), eachcol(df2)))

Dan · February 3, 2025, 11:35pm

Another method:

julia> isequal(map(t->stack(t, Not(:id), :id; view=true), (df, df_wide))...)
true

isequal plays nice with missing and compares DataFrames.
The t->... map “iterates” on all the cells of the DataFrame.

But this isn’t probably robust to row order changes.

rocco_sprmnt21 · February 4, 2025, 8:21am

In general, it is certainly necessary to check the names too.
But, in this case, the names are taken from the same set in a different order.
I wonder if there is a function in the Tables package that directly gives the dictionary of rows or columns that is suitable for “our” case.

fran94 · February 4, 2025, 2:42pm

Thank you for all the inputs!
I’ll attach my first solution as well, which was just me looping through the columns:

for col in names(df)
    isequal(df[!,col], df_wide[!,col])
end

I’m guessing this also doesn’t consider some edge case that I’m not thinking about now. I guess the question is if it is sensitive to send in a PR to DataFrames.jl proposing to “promote” one of these as the isequal() method for DataFrames? Not sure if it is such a common use cases since I also came across it kind of recently, but might be worthwhile…

Topic		Replies	Views
Is there a package to compare if two DataFrames are the same? New to Julia	11	2217	January 15, 2024
Is there a package to do DataFrames comparisons? General Usage dataframes	4	1285	August 3, 2021
Various constructors and equality for DataFrame Data question	5	1146	January 18, 2017
How to compare non-missing elements of two DataFrames New to Julia	3	557	July 1, 2020
Comparing DataFrames native API and Query Data	4	1522	September 1, 2017

Compare dataframes regardless of column order

Related topics