Hi.
I am trying to create a function that replicates the unnest_tokens()
function in the tidytext
R package. The version I came up with is the following:
function unnest_tokens3(df::DataFrame, input_col::Symbol, output_col::Symbol)
tokenized = StringDocument.(df[!, input_col])
df = select(df, Not(input_col))
token_list = tokens.(tokenized)
flat_token_list = vcat(token_list...)
repeat_lengths = length.(token_list)
repeat_indices = Vector{Int}(undef, sum(repeat_lengths))
counter = 1
for i in 1:length(repeat_lengths)
repeat_indices[counter:repeat_lengths[i]+(counter-1)] .= i
counter = repeat_lengths[i]+(counter)
end
new_df = @view df[repeat_indices, :]
new_df[!, output_col] = flat_token_list
return new_df
end
I could definitely do a bit better wrt code optimization, but I am not sure there is a version that could beat the original R function. Or is it?
The data I used to check both versions is downloaded from a github repo:
netflix_titles = CSV.read(Downloads.download("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-20/netflix_titles.csv"), DataFrame);
And the function call is: unnest_tokens3(netflix_titles, :description, :word)
. The corresponding R function call is unnest_tokens(netflix_titles, word, description)
.
I @btime
d both functions, tidytext
’s and mine, on these same data and tidytext
’s version is >10 times faster:
julia> @btime R"unnest_tokens($netflix_titles, word, description)";
90.973 ms (39119 allocations: 1023.95 KiB)
julia> @btime unnest_tokens3(netflix_titles, :description, :word);
1.077 s (3685252 allocations: 199.77 MiB)
Any idea as to what can be improved/changed or at least help me understand why R’s function is so much better would be greatly appreciated.
Thanks!