Split dataframe row into multiple rows

Elmo · April 28, 2022, 1:56pm

Hi all,

I am pretty new to using DataFrames (but loving it so far), however, I could not find an obvious way to achieve the following transformation in the docs.

Suppose I have a dataframe df below:

julia> using DataFrames

julia> df = DataFrame(A=[1,2,3,4], B=["a; b", missing, "b", "a; c"])
4×2 DataFrame
 Row │ A      B       
     │ Int64  String?
─────┼────────────────
   1 │     1  a; b
   2 │     2  missing
   3 │     3  b
   4 │     4  a; c

and I want to convert it into:

julia> df2 = DataFrame(A=[1,1,2,3,4,4], B=["a", "b", missing, "b", "a","c"])
6×2 DataFrame
 Row │ A      B       
     │ Int64  String?
─────┼────────────────
   1 │     1  a
   2 │     1  b
   3 │     2  missing
   4 │     3  b
   5 │     4  a
   6 │     4  c

What is the best way to achieve this? I tried:

transform(
    df,
    :B =>
        ByRow(
            x ->
                ismissing(x) ? missing :
                string.(split(x, ";")),
        ) => :C,
)

But this just creates this:

 Row │ A      B        C
     │ Int64  String?  Array…?
─────┼─────────────────────────────
   1 │     1  a; b     ["a", " b"]
   2 │     2  missing  missing
   3 │     3  b        ["b"]
   4 │     4  a; c     ["a", " c"]

Which is not quite right. Does anybody have any ideas?

nilshg · April 28, 2022, 2:30pm

Almost there:

julia> transform!(df, :B => (x -> ismissing(x) ? [missing] : string.(split(x, ";")), ) => :C)
4×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Vector
─────┼─────────────────────────────
   1 │     1  a; b     ["a", " b"]
   2 │     2  missing  [missing]
   3 │     3  b        ["b"]
   4 │     4  a; c     ["a", " c"]

julia> flatten(df, :C)
6×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  String?
─────┼─────────────────────────
   1 │     1  a; b     a
   2 │     1  a; b      b
   3 │     2  missing  missing
   4 │     3  b        b
   5 │     4  a; c     a
   6 │     4  a; c      c

Elmo · April 28, 2022, 2:36pm

Awesome, thanks! This works

rocco_sprmnt21 · April 30, 2022, 7:34am

A couple of “light” comments and a couple of questions.
For an aesthetic reason perhaps it is better to use “;” as a separator.
Did you skip the use of the ByRow! Function in @nilshg’s solution?

Question 1: What is the purpose of the string.(...) function?

Question 2: Someone explain to me how things go in the following transformations (why does line 2 disappear?):

julia> transform!(df, :B => ByRow(x -> ismissing(x) ? "" : string.(split(x, "; "))) => :C)
4×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Any
─────┼────────────────────────────
   1 │     1  a; b     ["a", "b"]
   2 │     2  missing
   3 │     3  b        ["b"]
   4 │     4  a; c     ["a", "c"]

julia> flatten(df,:C)
5×3 DataFrame
 Row │ A      B        C      
     │ Int64  String?  String
─────┼────────────────────────
   1 │     1  a; b     a
   2 │     1  a; b     b
   3 │     3  b        b
   4 │     4  a; c     a
   5 │     4  a; c     c

pdeffebach · April 30, 2022, 8:04am

julia> transform!(df, :B => ByRow(x -> ismissing(x) ? "" : string.(split(x, "; "))) => :C)

Each function called here should be easy to understand with ? in the REPL.

julia> flatten(df,:C)

It should be clear why row 2 goes away: No vector, so when the data frame is flattened there is nothing to put in row 2.

rocco_sprmnt21 · April 30, 2022, 8:37am

also in the following cases there is no vector, but the result is different (apparently).
It seems to me that other times this doubt has arisen, about how to treat strings with the flatten function, whether as scalars or not.
I would think to extend the doubt also the case of missing, so @nilshg had to embed it in a vector to make his script work

julia> transform!(df, :B => ByRow(x -> ismissing(x) ? " " : string.(split(x, "; "))) => :C)
4×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Any        
─────┼────────────────────────────
   1 │     1  a; b     ["a", "b"]
   2 │     2  missing 
   3 │     3  b        ["b"]
   4 │     4  a; c     ["a", "c"]

julia> flatten(df,:C)
6×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Any
─────┼─────────────────────
   1 │     1  a; b     a
   2 │     1  a; b     b
   3 │     2  missing
   4 │     3  b        b
   5 │     4  a; c     a
   6 │     4  a; c     c

julia> transform!(df, :B => ByRow(x -> ismissing(x) ? "missing " : string.(split(x, "; "))) => :C)
4×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Any
─────┼────────────────────────────
   1 │     1  a; b     ["a", "b"]
   2 │     2  missing  missing
   3 │     3  b        ["b"]
   4 │     4  a; c     ["a", "c"]

julia> flatten(df,:C)
13×3 DataFrame
 Row │ A      B        C
     │ Int64  String?  Any
─────┼─────────────────────
   1 │     1  a; b     a
   2 │     1  a; b     b
   3 │     2  missing  m
   4 │     2  missing  i
   5 │     2  missing  s
   6 │     2  missing  s
   7 │     2  missing  i
   8 │     2  missing  n
   9 │     2  missing  g
  10 │     2  missing
  11 │     3  b        b
  12 │     4  a; c     a
  13 │     4  a; c     c

pdeffebach · April 30, 2022, 9:08am

I think it’s pretty clear, though.

missing isn’t iterable, hence the error.

"" is iterable but has length 0, so the row is dropped.

"missing" is iterable. It flattens.

The docs seem clear about this:

When columns cols of data frame df have iterable elements that define length (for
example a Vector of Vectors), return a DataFrame where each element of each col in
cols is flattened, meaning the column corresponding to col becomes a longer vector
where the original entries are concatenated.

If you have further questions, please open a new post or file an issue with DataFrames.jl. This bikeshedding is distracting from OP’s original post asking for help.

rocco_sprmnt21 · April 30, 2022, 10:45am

I don’t think I’ve gone off topic and, in any case, a 6/7 message discussion isn’t hard to follow.
I asked several questions perhaps more pertinent than this one, but you wanted to deepen this aspect of the flatten function and I gladly accepted to discuss it.
Actually not to nit-pick, the statement “return a DataFrame where each element of each col in cols is flattened, meaning the column corresponding to col becomes a longer vector” is generally not correct.

More seriously my doubt was whether this was a generally expected result (regardless of what the documentation, more or less correctly, says).

PS
In the InMemoryDatasets package it seems to be prevented from modifying columns so that they have strings along with vectors.
Even if starting from a dataset with the mix of strings and vectors, the flatten function behaves the same way.


using  InMemoryDatasets

df = Dataset(A=[1,2,3,4], B=["a; b", missing, "b", "a; c"])

mdf=modify(df, :B=>byrow(x->!ismissing(x) ? split(x, "; ") : x)) #   x not [x] here
flatten(mdf, :B) # also works with missing and does not need [missing]

mdf=modify(df, :B=>byrow(x->!ismissing(x) ? split(x, "; ") : "")) # Cannot `convert` an object of type String to an object of type Vector{SubString{String}}        

mdf=modify(df, :B=>byrow(x->!ismissing(x) ? split(x, "; ") : [""]))

flatten(mdf, :B)

mdf=modify(df, :B=>byrow(x->!ismissing(x) ? split(x, "; ") : "missing"))


df = Dataset(A=[1,2,3,4], B=[["a","b"], "", "b", ["a","c"]])

flatten(mdf, :B)

monopolynomial · May 1, 2022, 1:16am

a little off topic, don’t u think it could be useful if flattten! function in IMD supports mapformats.then for cases like this I could define format fmt(x)=split(x,"; ") and the workflow was cleaner.

Topic		Replies	Views
Transform DataFrame into a single row by adding row contents as new columns New to Julia dataframes	3	314	May 1, 2021
How to explode a DataFrame cell containing multiple values into separate rows? General Usage	7	1109	July 7, 2020
[DataFrames Question]: How to convert single column with row of dictionary to multiple columns Specific Domains question , dataframes	4	516	May 14, 2022
Create DataFrame from data in DataFrameRow Data dataframesmeta	7	178	November 11, 2024
DataFrame transform with many output columns General Usage dataframes	1	270	March 28, 2022

Split dataframe row into multiple rows

Related topics