DataFrames groupby() by a column of mutable custom type

Here is a small example:

mutable struct Mystring
    str::String
end

using DataFrames

strs = ["a","b","a","b"];
mystrs = Mystring.(strs);
df = DataFrame(col_str=strs, col_mystr=mystrs);
println(df)

println(groupby(df, :col_str))
println(groupby(df, :col_mystr))

I would expect both groupby()'s to return the same groups (g1 gets rows 1 and 3, and g2 gets rows 2 and 4). Instead, the groupby(df, :col_mystr) returns 4 groups, each having a single row.

I tried overloading the simple comparison operators, but the result did not change:

Base.:(==)(str1::Mystring, str2::Mystring) = str1.str == str2.str
Base.:(>)(str1::Mystring, str2::Mystring) = str1.str > str2.str
Base.:(<)(str1::Mystring, str2::Mystring) = str1.str < str2.str

Important is also that this behavior is specific to mutable struct; if Mystring is declared as an immutable struct, groupby() works.

What am I missing?

Thanks :slight_smile:

Very interesting! Great to see someone experimenting with grouped dataframes and custom types.

I’m glad that this works for an immutable type. Though I would have expected hash to be the thing you need to define rather than ==.

I don’t know what’s going on with mutable types. This is very interesting. I am pinging @bkamins on this.

Hopefully when we understand the behavior we can add this to the docs.

3 Likes

The standard thing happens, the equality is checked with isequal not == (otherwise missing would not be handled correctly for instance as they would not produce Bool but missing). You need to define isequal and in consequence also hash for your type.

See:

  isequal(x, y)

  Similar to ==, except for the treatment of floating point numbers and of missing values. isequal treats all floating-point NaN values as equal to each other, treats -0.0 as unequal to
  0.0, and missing as equal to missing. Always returns a Bool value.

  isequal is the comparison function used by hash tables (Dict). isequal(x,y) must imply that hash(x) == hash(y).
3 Likes

This is implicitly implied by this line in groupby documentation:

GroupedDataFrame also supports the dictionary interface.

but indeed we could be explicit here.

I was not aware isequal was not the same as ==, thanks.

For future reference, this is the solution:

Base.isequal(str1::Mystring, str2::Mystring) = str1.str == str2.str
Base.hash(str::Mystring, h::UInt64) = hash(str.str, h)

Thanks again!

4 Likes