Here is a small example:
mutable struct Mystring
str::String
end
using DataFrames
strs = ["a","b","a","b"];
mystrs = Mystring.(strs);
df = DataFrame(col_str=strs, col_mystr=mystrs);
println(df)
println(groupby(df, :col_str))
println(groupby(df, :col_mystr))
I would expect both groupby()
's to return the same groups (g1 gets rows 1 and 3, and g2 gets rows 2 and 4). Instead, the groupby(df, :col_mystr)
returns 4 groups, each having a single row.
I tried overloading the simple comparison operators, but the result did not change:
Base.:(==)(str1::Mystring, str2::Mystring) = str1.str == str2.str
Base.:(>)(str1::Mystring, str2::Mystring) = str1.str > str2.str
Base.:(<)(str1::Mystring, str2::Mystring) = str1.str < str2.str
Important is also that this behavior is specific to mutable struct; if Mystring
is declared as an immutable struct
, groupby()
works.
What am I missing?
Thanks
Very interesting! Great to see someone experimenting with grouped dataframes and custom types.
I’m glad that this works for an immutable type. Though I would have expected hash
to be the thing you need to define rather than ==
.
I don’t know what’s going on with mutable types. This is very interesting. I am pinging @bkamins on this.
Hopefully when we understand the behavior we can add this to the docs.
3 Likes
The standard thing happens, the equality is checked with isequal
not ==
(otherwise missing
would not be handled correctly for instance as they would not produce Bool
but missing
). You need to define isequal
and in consequence also hash
for your type.
See:
isequal(x, y)
Similar to ==, except for the treatment of floating point numbers and of missing values. isequal treats all floating-point NaN values as equal to each other, treats -0.0 as unequal to
0.0, and missing as equal to missing. Always returns a Bool value.
isequal is the comparison function used by hash tables (Dict). isequal(x,y) must imply that hash(x) == hash(y).
3 Likes
This is implicitly implied by this line in groupby
documentation:
GroupedDataFrame
also supports the dictionary interface.
but indeed we could be explicit here.
I was not aware isequal
was not the same as ==
, thanks.
For future reference, this is the solution:
Base.isequal(str1::Mystring, str2::Mystring) = str1.str == str2.str
Base.hash(str::Mystring, h::UInt64) = hash(str.str, h)
Thanks again!
4 Likes