Flatten in case column contains string and array

rocco_sprmnt21 · May 17, 2021, 8:39am

I would like to have clarification (or even simply comments) on the behavior of the flatten function in the case of the example of column y.

using DataFrames
df=DataFrame(x=rand(1:20,5),y=["aa",["a1","a2"],"bbb",[1,2,3],"cc"],z=[11,21, [31,32],41,51])
flatten(df,:z)
flatten(df,:y)
df.yvec=[isa(i,Array) ? i : [i] for i in df.y]
flatten(df,:yvec)

julia> flatten(df,:z)
6×3 DataFrame
 Row │ x      y             z     
     │ Int64  Any           Int64 
─────┼────────────────────────────
   1 │     4  aa               11
   2 │     5  ["a1", "a2"]     21
   3 │     1  bbb              31
   4 │     1  bbb              32
   5 │     3  [1, 2, 3]        41
   6 │     5  cc               51

julia> flatten(df,:y)
12×3 DataFrame
 Row │ x      y    z        
     │ Int64  Any  Any      
─────┼──────────────────────
   1 │     4  a    11
   2 │     4  a    11
   3 │     5  a1   21
   4 │     5  a2   21
   5 │     1  b    [31, 32]
   6 │     1  b    [31, 32]
   7 │     1  b    [31, 32]
   8 │     3  1    41
   9 │     3  2    41
  10 │     3  3    41
  11 │     5  c    51
  12 │     5  c    51

I would have expected more a result like that of the yvec column

julia> df
5×4 DataFrame
 Row │ x      y             z         yvec
     │ Int64  Any           Any       Array…       
─────┼─────────────────────────────────────────────
   1 │     4  aa            11        ["aa"]
   2 │     5  ["a1", "a2"]  21        ["a1", "a2"]
   3 │     1  bbb           [31, 32]  ["bbb"]
   4 │     3  [1, 2, 3]     41        [1, 2, 3]
   5 │     5  cc            51        ["cc"]

julia> flatten(df,:yvec)
8×4 DataFrame
 Row │ x      y             z         yvec 
     │ Int64  Any           Any       Any  
─────┼─────────────────────────────────────
   1 │     4  aa            11        aa
   2 │     5  ["a1", "a2"]  21        a1
   3 │     5  ["a1", "a2"]  21        a2
   4 │     1  bbb           [31, 32]  bbb
   5 │     3  [1, 2, 3]     41        1
   6 │     3  [1, 2, 3]     41        2
   7 │     3  [1, 2, 3]     41        3
   8 │     5  cc            51        cc

sijo · May 17, 2021, 10:42am

Interesting, I was also surprised by your example. I thought flatten would only flatten explicit lists like vectors and tuples. But the documentation for flatten says

When columns cols of data frame df have iterable elements that define length…

And strings are iterables that define length, so they are treated as a container that can be flattened.

rocco_sprmnt21 · May 17, 2021, 12:32pm

I never thought it was a mistake in the slightest.
Except that in the case of mixed content (scalars, vectors) in one column, I would have expected a different behavior.
But just know it and, if necessary, adapt the case to your needs.

pdeffebach · May 17, 2021, 1:23pm

This is probably a bug. I don’t think I would expect strings to be flattened like that. Can you file an issue with DataFrames?

bkamins · May 17, 2021, 1:40pm

This behavior is expected, as the contract is:

When columns cols of data frame df have iterable elements that define length

You would have to complain to Julia Base why AbstractString is iterable and defines length . This is a common issue with strings in general.

How would you propose to change things in DataFrames.jl? We could special case AbstractStrings and throw an error if you try to flatten them.

Now I see that @sijo commented the same already. This is exactly the issue that strings are collections in Julia. Actually numbers are also collections, but they are 1-element collections so you would not have a problem in this case.

rocco_sprmnt21 · May 17, 2021, 3:28pm

I would expect the same result as applying flatten to the yvec column.
Although, I must admit, I have not reflected on possible contraindications due to unexpected side effects in some cases.
I notice incidentally that a value of type string is treated as “scalar” (whatever it formally means) by the broadcast function: ‘_’. * “abcde” returns “_abcde”

bkamins · May 17, 2021, 4:58pm

I propose to discuss it here: Improve flatten (slightly breaking) · Issue #2767 · JuliaData/DataFrames.jl · GitHub

Topic		Replies	Views
Split and flatten a field from CSV and use as DataFrame New to Julia dataframes	5	725	May 27, 2020
JuliaDB.flatten equivalent in DataFrames Data dataframes	7	1148	November 14, 2019
Transforming DataFrame from column of vector General Usage dataframes	3	246	December 1, 2022
Split dataframe row into multiple rows Data dataframes	8	1795	May 1, 2022
Unnesting columns of a data frame containing arrays General Usage question , dataframes , data_structures	1	509	December 9, 2020

Flatten in case column contains string and array

Related topics