Split dataframe row into multiple rows

Hi all,

I am pretty new to using DataFrames (but loving it so far), however, I could not find an obvious way to achieve the following transformation in the docs.

Suppose I have a dataframe df below:

julia> using DataFrames

julia> df = DataFrame(A=[1,2,3,4], B=["a; b", missing, "b", "a; c"])
4×2 DataFrame
 Row │ A      B       
     │ Int64  String?
─────┼────────────────
   1 │     1  a; b
   2 │     2  missing
   3 │     3  b
   4 │     4  a; c

and I want to convert it into:

julia> df2 = DataFrame(A=[1,1,2,3,4,4], B=["a", "b", missing, "b", "a","c"])
6×2 DataFrame
 Row │ A      B       
     │ Int64  String?
─────┼────────────────
   1 │     1  a
   2 │     1  b
   3 │     2  missing
   4 │     3  b
   5 │     4  a
   6 │     4  c

What is the best way to achieve this? I tried:

transform(
    df,
    :B =>
        ByRow(
            x ->
                ismissing(x) ? missing :
                string.(split(x, ";")),
        ) => :C,
)

But this just creates this:

 Row │ A      B        C
     │ Int64  String?  Array…?
─────┼─────────────────────────────
   1 │     1  a; b     ["a", " b"]
   2 │     2  missing  missing
   3 │     3  b        ["b"]
   4 │     4  a; c     ["a", " c"]

Which is not quite right. Does anybody have any ideas?

Almost there:

julia> transform!(df, :B => (x -> ismissing(x) ? [missing] : string.(split(x, ";")), ) => :C)
4×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Vector
─────┼─────────────────────────────
   1 │     1  a; b     ["a", " b"]
   2 │     2  missing  [missing]
   3 │     3  b        ["b"]
   4 │     4  a; c     ["a", " c"]

julia> flatten(df, :C)
6×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  String?
─────┼─────────────────────────
   1 │     1  a; b     a
   2 │     1  a; b      b
   3 │     2  missing  missing
   4 │     3  b        b
   5 │     4  a; c     a
   6 │     4  a; c      c
7 Likes

Awesome, thanks! This works :slight_smile:

A couple of “light” comments and a couple of questions.
For an aesthetic reason perhaps it is better to use “;” as a separator.
Did you skip the use of the ByRow! Function in @nilshg’s solution?

Question 1: What is the purpose of the string.(...) function?

Question 2: Someone explain to me how things go in the following transformations (why does line 2 disappear?):

julia> transform!(df, :B => ByRow(x -> ismissing(x) ? "" : string.(split(x, "; "))) => :C)
4×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Any
─────┼────────────────────────────
   1 │     1  a; b     ["a", "b"]
   2 │     2  missing
   3 │     3  b        ["b"]
   4 │     4  a; c     ["a", "c"]

julia> flatten(df,:C)
5×3 DataFrame
 Row │ A      B        C      
     │ Int64  String?  String
─────┼────────────────────────
   1 │     1  a; b     a
   2 │     1  a; b     b
   3 │     3  b        b
   4 │     4  a; c     a
   5 │     4  a; c     c
julia> transform!(df, :B => ByRow(x -> ismissing(x) ? "" : string.(split(x, "; "))) => :C)

Each function called here should be easy to understand with ? in the REPL.

julia> flatten(df,:C)

It should be clear why row 2 goes away: No vector, so when the data frame is flattened there is nothing to put in row 2.

also in the following cases there is no vector, but the result is different (apparently).
It seems to me that other times this doubt has arisen, about how to treat strings with the flatten function, whether as scalars or not.
I would think to extend the doubt also the case of missing, so @nilshg had to embed it in a vector to make his script work

julia> transform!(df, :B => ByRow(x -> ismissing(x) ? " " : string.(split(x, "; "))) => :C)
4×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Any        
─────┼────────────────────────────
   1 │     1  a; b     ["a", "b"]
   2 │     2  missing 
   3 │     3  b        ["b"]
   4 │     4  a; c     ["a", "c"]

julia> flatten(df,:C)
6×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Any
─────┼─────────────────────
   1 │     1  a; b     a
   2 │     1  a; b     b
   3 │     2  missing
   4 │     3  b        b
   5 │     4  a; c     a
   6 │     4  a; c     c

julia> transform!(df, :B => ByRow(x -> ismissing(x) ? "missing " : string.(split(x, "; "))) => :C)
4×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Any
─────┼────────────────────────────
   1 │     1  a; b     ["a", "b"]
   2 │     2  missing  missing
   3 │     3  b        ["b"]
   4 │     4  a; c     ["a", "c"]

julia> flatten(df,:C)
13×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Any
─────┼─────────────────────
   1 │     1  a; b     a
   2 │     1  a; b     b
   3 │     2  missing  m
   4 │     2  missing  i
   5 │     2  missing  s
   6 │     2  missing  s
   7 │     2  missing  i
   8 │     2  missing  n
   9 │     2  missing  g
  10 │     2  missing
  11 │     3  b        b
  12 │     4  a; c     a
  13 │     4  a; c     c

I think it’s pretty clear, though.

missing isn’t iterable, hence the error.

"" is iterable but has length 0, so the row is dropped.

"missing" is iterable. It flattens.

The docs seem clear about this:

When columns cols of data frame df have iterable elements that define length (for
example a Vector of Vectors), return a DataFrame where each element of each col in
cols is flattened, meaning the column corresponding to col becomes a longer vector
where the original entries are concatenated.

If you have further questions, please open a new post or file an issue with DataFrames.jl. This bikeshedding is distracting from OP’s original post asking for help.

2 Likes

I don’t think I’ve gone off topic and, in any case, a 6/7 message discussion isn’t hard to follow.
I asked several questions perhaps more pertinent than this one, but you wanted to deepen this aspect of the flatten function and I gladly accepted to discuss it.
Actually not to nit-pick, the statement “return a DataFrame where each element of each col in cols is flattened, meaning the column corresponding to col becomes a longer vector” is generally not correct.

More seriously my doubt was whether this was a generally expected result (regardless of what the documentation, more or less correctly, says).

PS
In the InMemoryDatasets package it seems to be prevented from modifying columns so that they have strings along with vectors.
Even if starting from a dataset with the mix of strings and vectors, the flatten function behaves the same way.


using  InMemoryDatasets

df = Dataset(A=[1,2,3,4], B=["a; b", missing, "b", "a; c"])

mdf=modify(df, :B=>byrow(x->!ismissing(x) ? split(x, "; ") : x)) #   x not [x] here
flatten(mdf, :B) # also works with missing and does not need [missing]

mdf=modify(df, :B=>byrow(x->!ismissing(x) ? split(x, "; ") : "")) # Cannot `convert` an object of type String to an object of type Vector{SubString{String}}        

mdf=modify(df, :B=>byrow(x->!ismissing(x) ? split(x, "; ") : [""]))

flatten(mdf, :B)

mdf=modify(df, :B=>byrow(x->!ismissing(x) ? split(x, "; ") : "missing"))


df = Dataset(A=[1,2,3,4], B=[["a","b"], "", "b", ["a","c"]])

flatten(mdf, :B)
1 Like

a little off topic, don’t u think it could be useful if flattten! function in IMD supports mapformats.then for cases like this I could define format fmt(x)=split(x,"; ") and the workflow was cleaner.

1 Like