Flatten in case column contains string and array

I would like to have clarification (or even simply comments) on the behavior of the flatten function in the case of the example of column y.

using DataFrames
df=DataFrame(x=rand(1:20,5),y=["aa",["a1","a2"],"bbb",[1,2,3],"cc"],z=[11,21, [31,32],41,51])
flatten(df,:z)
flatten(df,:y)
df.yvec=[isa(i,Array) ? i : [i] for i in df.y]
flatten(df,:yvec)
julia> flatten(df,:z)
6×3 DataFrame
 Row │ x      y             z     
     │ Int64  Any           Int64 
─────┼────────────────────────────
   1 │     4  aa               11
   2 │     5  ["a1", "a2"]     21
   3 │     1  bbb              31
   4 │     1  bbb              32
   5 │     3  [1, 2, 3]        41
   6 │     5  cc               51
julia> flatten(df,:y)
12×3 DataFrame
 Row │ x      y    z        
     │ Int64  Any  Any      
─────┼──────────────────────
   1 │     4  a    11
   2 │     4  a    11
   3 │     5  a1   21
   4 │     5  a2   21
   5 │     1  b    [31, 32]
   6 │     1  b    [31, 32]
   7 │     1  b    [31, 32]
   8 │     3  1    41
   9 │     3  2    41
  10 │     3  3    41
  11 │     5  c    51
  12 │     5  c    51

I would have expected more a result like that of the yvec column

julia> df
5×4 DataFrame
 Row │ x      y             z         yvec
     │ Int64  Any           Any       Array…       
─────┼─────────────────────────────────────────────
   1 │     4  aa            11        ["aa"]
   2 │     5  ["a1", "a2"]  21        ["a1", "a2"]
   3 │     1  bbb           [31, 32]  ["bbb"]
   4 │     3  [1, 2, 3]     41        [1, 2, 3]
   5 │     5  cc            51        ["cc"]
julia> flatten(df,:yvec)
8×4 DataFrame
 Row │ x      y             z         yvec 
     │ Int64  Any           Any       Any  
─────┼─────────────────────────────────────
   1 │     4  aa            11        aa
   2 │     5  ["a1", "a2"]  21        a1
   3 │     5  ["a1", "a2"]  21        a2
   4 │     1  bbb           [31, 32]  bbb
   5 │     3  [1, 2, 3]     41        1
   6 │     3  [1, 2, 3]     41        2
   7 │     3  [1, 2, 3]     41        3
   8 │     5  cc            51        cc

Interesting, I was also surprised by your example. I thought flatten would only flatten explicit lists like vectors and tuples. But the documentation for flatten says

When columns cols of data frame df have iterable elements that define length

And strings are iterables that define length, so they are treated as a container that can be flattened.

I never thought it was a mistake in the slightest.
Except that in the case of mixed content (scalars, vectors) in one column, I would have expected a different behavior.
But just know it and, if necessary, adapt the case to your needs.

This is probably a bug. I don’t think I would expect strings to be flattened like that. Can you file an issue with DataFrames?

This behavior is expected, as the contract is:

When columns cols of data frame df have iterable elements that define length

You would have to complain to Julia Base why AbstractString is iterable and defines length :slight_smile:. This is a common issue with strings in general.

How would you propose to change things in DataFrames.jl? We could special case AbstractStrings and throw an error if you try to flatten them.


Now I see that @sijo commented the same already. This is exactly the issue that strings are collections in Julia. Actually numbers are also collections, but they are 1-element collections so you would not have a problem in this case.

1 Like

I would expect the same result as applying flatten to the yvec column.
Although, I must admit, I have not reflected on possible contraindications due to unexpected side effects in some cases.
I notice incidentally that a value of type string is treated as “scalar” (whatever it formally means) by the broadcast function: ‘_’. * “abcde” returns “_abcde”

I propose to discuss it here: Improve flatten (slightly breaking) · Issue #2767 · JuliaData/DataFrames.jl · GitHub