Separating a column into a variable number of possible columns

Hi there.

Am trying to replicate some R code based on a movies dataset (available here) using Julia's data wrangling utilities.

What am trying to achieve is to separate/split a column into two or more columns depending on how many commas there are in the cell value. If the cell value has two commas it is going to fill in 3 new columns and if only one comma, then it would fill 2 new columns while the third column will have a missing values.

The desired result is given in R-tidyverse with the following code:

movies <- movies %>%  
  separate(col = genre,
           into = c("genre1", "genre2", "genre3"),
           sep = ",")

I know that there had been a discussion on the issue and that @bkamins posted on his blog about it here. I tried to use the ByRow in combination with select() as you can see here, but it does not seem to work:

@pipe movies |>
    select(_, :genre => ByRow(x -> split(x, ",")) => [:genre1, :genre2, :genre3], [:genre])

Any help would be greatly appreciated!


Just need to use transform instead of select.

Thanks for the reply! I tried it and got the same error message:

@pipe movies |>
           transform(_, :genre => ByRow(x -> split(x, ",")) => [:genre1, :genre2, :genre3], [:genre])
ERROR: ArgumentError: keys of the returned elements must be identical
 [1] _expand_to_table(res::Vector{Vector{SubString{String}}})
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:380
 [2] select_transform!(nc::Union{Function, Pair{var"#s267", var"#s266"} where {var"#s267"<:Union{Int64, AsTable, AbstractVector{Int64}}, var"#s266"<:(Pair{var"#s164", var"#s163"} where {var"#s164"<:Union{Function, Type}, var"#s163"<:Union{DataType, Symbol, AbstractVector{Symbol}}})}, Type}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool})
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:522
 [3] _manipulate(df::DataFrame, normalized_cs::Any, copycols::Bool, keeprows::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:1279
 [4] manipulate(::DataFrame, ::Any, ::Vararg{Any, N} where N; copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:1209
 [5] #select#384
   @ ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:847 [inlined]
 [6] #transform#386
   @ ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:913 [inlined]
 [7] transform(::DataFrame, ::Any, ::Any)
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:913
 [8] top-level scope
   @ REPL[79]:1 

also remove that and try? it works for me

Did you test with a case where not all strings have 3 genres?

I think you need something like this:

df = DataFrame(genre=["a,b,c", "a,b"])

select(df, :genre =>
           ByRow(x -> get.(Ref(split(x, ',')), 1:3, missing)) =>
           [:genre1, :genre2, :genre3])
1 Like

Unfortunately, it did not work either…However, the solution by @sudete works!

1 Like

Thanks a lot! Works like a charm!

You’re welcome!

And for reference here’s a more general version, that also works when the maximum number of genres is different from 3 (this one doesn’t use ByRow since it needs the whole column to find the correct maximum number of genres):

df = DataFrame(genre=["a,b,c", "a,b"])

function split_uniformly(v)
    s = split.(v, ',')
    n = maximum(length.(s))
    [NamedTuple(Symbol.("genre", 1:n) .=> get.(Ref(genres), 1:n, missing))
     for genres in s]

julia> select(df, :genre => split_uniformly => AsTable)
2×3 DataFrame
 Row │ genre1     genre2     genre3     
     │ SubStrin…  SubStrin…  SubStrin…? 
   1 │ a          b          c
   2 │ a          b          missing   
1 Like

This can be done with DataFramesMeta relatively easily as well

@chain df begin
    @rtransform $[:genre1, :genre2, :genre3] = split(:genres, ",")

Though I guess this will fail with your exact example. To make it work I think you need the example above

@chain df begin
    @rtransform $[:genre1, :genre2, :genre3] = begin
        x = split(:genres, ",")
        get.(Ref(x), 1:3, missing)

Thanks a lot! I guess with great power comes great responsibility!

We made expecting a constant number of columns produced decision on purpose (to catch bugs in the code). Do you think being flexible would be better and if so why?

1 Like

It is a very common use case scenario in the humanities to read in a spreadsheet file that has one or more columns with free text answers to a questionnaire. For instance, for the question “which languages do you speak?” we get responses with variable answers, e.g. some people answer “German, French” and others respond with “German, French, Greek” while others with just “English”. The ideal is to be able to transform the responses in a way that the answers that are “stored” in a single column are spread into separate columns with second, third, fourth language etc. while leaving missing values for those who only speak one language.
I hope I was clear enough.

PS.: By the way, thanks a lot for the effort you have been putting on improving DataFrames.jl and the tutorial this summer!

1 Like

Would it be ok to keep a column of vectors. Then count the unique languages in the vectors, then just expand them out using a function with a fixed set of languages? That feels a more natural way to solve this in a readable way.

1 Like

yes - either what @xiaodai describes is natural or keep a column of vectors (without expanding it into multiple columns). The problem with expansion into undefined number of generic columns is that later it is not fully clear how such data frame should be used.


Thanks to both you and @xiaodai ! The solutions you suggest may indeed be more natural to follow. It’s the R instincts that came out immediately to solve such a problem.
The other thing is that the ranking of the languages play a role. People who prefer to write “English, German” are supposed to be different from people that respond with “German, English”. However, I suppose I could further manipulate the column of vectors to keep that in mind.
Thanks again!

1 Like

just for fun

vcat(DataFrame.(permutedims.(split.(df1.genre,",")),:auto)..., cols=:union)

1 Like

This is a good point. In vcat, append! and push! we allow variable number of columns per item.

1 Like

Thanks! Really cool! I get the logic, but I guess I will have to see exactly what permutedims() does, since I didn’t know it.

I wonder if in a situation like df2, the expected result is the following and not the one that derives from the previous solutions …

df2 = DataFrame(genre=["a,b,c", "a,b", "a,c","b,d"])

## new split_uni

julia> select(df2, :genre => split_uniformly => AsTable)
4×4 DataFrame
 Row │ a           b           c           d
     │ SubStrin…?  SubStrin…?  SubStrin…?  SubStrin…? 
   1 │ a           b           c           missing    
   2 │ a           b           missing     missing    
   3 │ a           missing     c           missing    
   4 │ missing     b           missing     d

1 Like

permutedims transforms a column vector into a row vector (== matrix).
i tried to use the transpose function or the ’ but i didn’t get what i wanted

1 Like