Am trying to replicate some R code based on a movies dataset (available here) using Julia’s data wrangling utilities.
What am trying to achieve is to separate/split a column into two or more columns depending on how many commas there are in the cell value. If the cell value has two commas it is going to fill in 3 new columns and if only one comma, then it would fill 2 new columns while the third column will have a missing values.
The desired result is given in R-tidyverse with the following code:
movies <- movies %>%
separate(col = genre,
into = c("genre1", "genre2", "genre3"),
sep = ",")
I know that there had been a discussion on the issue and that @bkamins posted on his blog about it here. I tried to use the ByRow in combination with select() as you can see here, but it does not seem to work:
And for reference here’s a more general version, that also works when the maximum number of genres is different from 3 (this one doesn’t use ByRow since it needs the whole column to find the correct maximum number of genres):
df = DataFrame(genre=["a,b,c", "a,b"])
function split_uniformly(v)
s = split.(v, ',')
n = maximum(length.(s))
[NamedTuple(Symbol.("genre", 1:n) .=> get.(Ref(genres), 1:n, missing))
for genres in s]
end
julia> select(df, :genre => split_uniformly => AsTable)
2×3 DataFrame
Row │ genre1 genre2 genre3
│ SubStrin… SubStrin… SubStrin…?
─────┼──────────────────────────────────
1 │ a b c
2 │ a b missing
We made expecting a constant number of columns produced decision on purpose (to catch bugs in the code). Do you think being flexible would be better and if so why?
Hi!
It is a very common use case scenario in the humanities to read in a spreadsheet file that has one or more columns with free text answers to a questionnaire. For instance, for the question “which languages do you speak?” we get responses with variable answers, e.g. some people answer “German, French” and others respond with “German, French, Greek” while others with just “English”. The ideal is to be able to transform the responses in a way that the answers that are “stored” in a single column are spread into separate columns with second, third, fourth language etc. while leaving missing values for those who only speak one language.
I hope I was clear enough.
PS.: By the way, thanks a lot for the effort you have been putting on improving DataFrames.jl and the tutorial this summer!
Would it be ok to keep a column of vectors. Then count the unique languages in the vectors, then just expand them out using a function with a fixed set of languages? That feels a more natural way to solve this in a readable way.
yes - either what @xiaodai describes is natural or keep a column of vectors (without expanding it into multiple columns). The problem with expansion into undefined number of generic columns is that later it is not fully clear how such data frame should be used.
Thanks to both you and @xiaodai ! The solutions you suggest may indeed be more natural to follow. It’s the R instincts that came out immediately to solve such a problem.
The other thing is that the ranking of the languages play a role. People who prefer to write “English, German” are supposed to be different from people that respond with “German, English”. However, I suppose I could further manipulate the column of vectors to keep that in mind.
Thanks again!
I wonder if in a situation like df2, the expected result is the following and not the one that derives from the previous solutions …
df2 = DataFrame(genre=["a,b,c", "a,b", "a,c","b,d"])
## new split_uni
julia> select(df2, :genre => split_uniformly => AsTable)
4×4 DataFrame
Row │ a b c d
│ SubStrin…? SubStrin…? SubStrin…? SubStrin…?
─────┼────────────────────────────────────────────────
1 │ a b c missing
2 │ a b missing missing
3 │ a missing c missing
4 │ missing b missing d