Separating a column into a variable number of possible columns

Alex_Tantos · October 1, 2021, 10:48am

Hi there.

Am trying to replicate some R code based on a movies dataset (available here) using Julia’s data wrangling utilities.

What am trying to achieve is to separate/split a column into two or more columns depending on how many commas there are in the cell value. If the cell value has two commas it is going to fill in 3 new columns and if only one comma, then it would fill 2 new columns while the third column will have a missing values.

The desired result is given in R-tidyverse with the following code:

movies <- movies %>%  
  separate(col = genre,
           into = c("genre1", "genre2", "genre3"),
           sep = ",")

I know that there had been a discussion on the issue and that @bkamins posted on his blog about it here. I tried to use the ByRow in combination with select() as you can see here, but it does not seem to work:

@pipe movies |>
    select(_, :genre => ByRow(x -> split(x, ",")) => [:genre1, :genre2, :genre3], [:genre])

Any help would be greatly appreciated!

Alex

xiaodai · October 1, 2021, 11:03am

Just need to use transform instead of select.

Alex_Tantos · October 1, 2021, 11:05am

Thanks for the reply! I tried it and got the same error message:

@pipe movies |>
           transform(_, :genre => ByRow(x -> split(x, ",")) => [:genre1, :genre2, :genre3], [:genre])
ERROR: ArgumentError: keys of the returned elements must be identical
Stacktrace:
 [1] _expand_to_table(res::Vector{Vector{SubString{String}}})
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:380
 [2] select_transform!(nc::Union{Function, Pair{var"#s267", var"#s266"} where {var"#s267"<:Union{Int64, AsTable, AbstractVector{Int64}}, var"#s266"<:(Pair{var"#s164", var"#s163"} where {var"#s164"<:Union{Function, Type}, var"#s163"<:Union{DataType, Symbol, AbstractVector{Symbol}}})}, Type}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool})
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:522
 [3] _manipulate(df::DataFrame, normalized_cs::Any, copycols::Bool, keeprows::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:1279
 [4] manipulate(::DataFrame, ::Any, ::Vararg{Any, N} where N; copycols::Bool, keeprows::Bool, renamecols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:1209
 [5] #select#384
   @ ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:847 [inlined]
 [6] #transform#386
   @ ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:913 [inlined]
 [7] transform(::DataFrame, ::Any, ::Any)
   @ DataFrames ~/.julia/packages/DataFrames/3mEXm/src/abstractdataframe/selection.jl:913
 [8] top-level scope
   @ REPL[79]:1

xiaodai · October 1, 2021, 11:05am

also remove that and try? it works for me

sijo · October 1, 2021, 11:09am

Did you test with a case where not all strings have 3 genres?

I think you need something like this:

df = DataFrame(genre=["a,b,c", "a,b"])

select(df, :genre =>
           ByRow(x -> get.(Ref(split(x, ',')), 1:3, missing)) =>
           [:genre1, :genre2, :genre3])

Alex_Tantos · October 1, 2021, 11:17am

Unfortunately, it did not work either…However, the solution by @sijo works!

Alex_Tantos · October 1, 2021, 11:18am

Thanks a lot! Works like a charm!

sijo · October 1, 2021, 11:24am

You’re welcome!

And for reference here’s a more general version, that also works when the maximum number of genres is different from 3 (this one doesn’t use ByRow since it needs the whole column to find the correct maximum number of genres):

df = DataFrame(genre=["a,b,c", "a,b"])

function split_uniformly(v)
    s = split.(v, ',')
    n = maximum(length.(s))
    [NamedTuple(Symbol.("genre", 1:n) .=> get.(Ref(genres), 1:n, missing))
     for genres in s]
end

julia> select(df, :genre => split_uniformly => AsTable)
2×3 DataFrame
 Row │ genre1     genre2     genre3     
     │ SubStrin…  SubStrin…  SubStrin…? 
─────┼──────────────────────────────────
   1 │ a          b          c
   2 │ a          b          missing

pdeffebach · October 1, 2021, 11:30am

This can be done with DataFramesMeta relatively easily as well

@chain df begin
    @rtransform $[:genre1, :genre2, :genre3] = split(:genres, ",")
end

Though I guess this will fail with your exact example. To make it work I think you need the example above

@chain df begin
    @rtransform $[:genre1, :genre2, :genre3] = begin
        x = split(:genres, ",")
        get.(Ref(x), 1:3, missing)
    end
end

Alex_Tantos · October 1, 2021, 12:08pm

Thanks a lot! I guess with great power comes great responsibility!

bkamins · October 1, 2021, 12:31pm

We made expecting a constant number of columns produced decision on purpose (to catch bugs in the code). Do you think being flexible would be better and if so why?

Alex_Tantos · October 1, 2021, 12:53pm

Hi!
It is a very common use case scenario in the humanities to read in a spreadsheet file that has one or more columns with free text answers to a questionnaire. For instance, for the question “which languages do you speak?” we get responses with variable answers, e.g. some people answer “German, French” and others respond with “German, French, Greek” while others with just “English”. The ideal is to be able to transform the responses in a way that the answers that are “stored” in a single column are spread into separate columns with second, third, fourth language etc. while leaving missing values for those who only speak one language.
I hope I was clear enough.

PS.: By the way, thanks a lot for the effort you have been putting on improving DataFrames.jl and the tutorial this summer!

xiaodai · October 1, 2021, 1:08pm

Would it be ok to keep a column of vectors. Then count the unique languages in the vectors, then just expand them out using a function with a fixed set of languages? That feels a more natural way to solve this in a readable way.

bkamins · October 1, 2021, 1:10pm

yes - either what @xiaodai describes is natural or keep a column of vectors (without expanding it into multiple columns). The problem with expansion into undefined number of generic columns is that later it is not fully clear how such data frame should be used.

Alex_Tantos · October 1, 2021, 1:15pm

Thanks to both you and @xiaodai ! The solutions you suggest may indeed be more natural to follow. It’s the R instincts that came out immediately to solve such a problem.
The other thing is that the ranking of the languages play a role. People who prefer to write “English, German” are supposed to be different from people that respond with “German, English”. However, I suppose I could further manipulate the column of vectors to keep that in mind.
Thanks again!

rocco_sprmnt21 · October 3, 2021, 9:51am

just for fun


vcat(DataFrame.(permutedims.(split.(df1.genre,",")),:auto)..., cols=:union)

bkamins · October 3, 2021, 10:15am

This is a good point. In vcat, append! and push! we allow variable number of columns per item.

Alex_Tantos · October 3, 2021, 10:21am

Thanks! Really cool! I get the logic, but I guess I will have to see exactly what permutedims() does, since I didn’t know it.

rocco_sprmnt21 · October 3, 2021, 12:03pm

I wonder if in a situation like df2, the expected result is the following and not the one that derives from the previous solutions …


df2 = DataFrame(genre=["a,b,c", "a,b", "a,c","b,d"])

## new split_uni


julia> select(df2, :genre => split_uniformly => AsTable)
4×4 DataFrame
 Row │ a           b           c           d
     │ SubStrin…?  SubStrin…?  SubStrin…?  SubStrin…? 
─────┼────────────────────────────────────────────────
   1 │ a           b           c           missing    
   2 │ a           b           missing     missing    
   3 │ a           missing     c           missing    
   4 │ missing     b           missing     d

rocco_sprmnt21 · October 3, 2021, 12:08pm

permutedims transforms a column vector into a row vector (== matrix).
i tried to use the transpose function or the ’ but i didn’t get what i wanted

Topic		Replies	Views
Split Column in Dataframe New to Julia dataframes	5	2186	February 4, 2022
JuliaDB - split one column into n columns General Usage juliadb	1	1446	February 12, 2019
Spilt column and save into different column New to Julia question	7	346	September 10, 2020
Split dataframe row into multiple rows Data dataframes	8	1793	May 1, 2022
Expand row to multiple rows Data	11	1287	December 15, 2020

Separating a column into a variable number of possible columns

Related topics