Hi!
I’m probably doing something wrong but I cannot seem to find an easy solution to this. Imagine I have a dataframe (e.g. the Palmer penguins data) and I convert it from wide to tidy format:
using PalmerPenguins, DataFrames
df = DataFrame(PalmerPenguins.load())
#Create an ID column
df.id = 1:size(df,1)
df_long = stack(df, Not([:species, :id])
Now if I want to go back to the original table, I should do something like:
df_wide = unstack(df_long, :variable, :value)
This is all good, except that while I know that df contains the same data as df_wide, both isequal(df, df_wide) and isequal.(df, df_wide) fail (either by returning false or saying that column names cannot be broadcasted since they are not in the same order). Is there a way to compare two dataframes regardless of column order (instead of me writing a for loop to compare each column separately)? Thanks!
This does work and does what I want it to do. I still think that in general an equality relationship between dataframes should not depend on order (unlike e.g. matrices), so maybe I could ask whether there is something that can be done from the package directly to be more user-friendly. Thank you!
I guess it would be nice to have functions like issubset and issubsetequal for a DataFrame, but I couldn’t find such functions in DataFrames.jl yet.
One of us would have to open a PR in DataFrames.jl to get this feature in. But I bet there are all kinds of edge cases to consider. Like do you really want ‘set’ equality for your case, or you also want to check for the same number of duplicate rows?
In general, it is certainly necessary to check the names too.
But, in this case, the names are taken from the same set in a different order.
I wonder if there is a function in the Tables package that directly gives the dictionary of rows or columns that is suitable for “our” case.
Thank you for all the inputs!
I’ll attach my first solution as well, which was just me looping through the columns:
for col in names(df)
isequal(df[!,col], df_wide[!,col])
end
I’m guessing this also doesn’t consider some edge case that I’m not thinking about now. I guess the question is if it is sensitive to send in a PR to DataFrames.jl proposing to “promote” one of these as the isequal() method for DataFrames? Not sure if it is such a common use cases since I also came across it kind of recently, but might be worthwhile…