Creating tidytext's unnest_tokens() in Julia and speed

Alex_Tantos · June 19, 2023, 11:15pm

Hi.

I am trying to create a function that replicates the unnest_tokens() function in the tidytext R package. The version I came up with is the following:

function unnest_tokens3(df::DataFrame, input_col::Symbol, output_col::Symbol)
    tokenized = StringDocument.(df[!, input_col])
    df = select(df, Not(input_col))
    token_list = tokens.(tokenized)

    flat_token_list = vcat(token_list...)
    repeat_lengths = length.(token_list)

    repeat_indices = Vector{Int}(undef, sum(repeat_lengths))
    counter = 1
    for i in 1:length(repeat_lengths)
        repeat_indices[counter:repeat_lengths[i]+(counter-1)] .= i
        counter = repeat_lengths[i]+(counter)
    end

    new_df = @view df[repeat_indices, :]
    new_df[!, output_col] = flat_token_list

    return new_df
end

I could definitely do a bit better wrt code optimization, but I am not sure there is a version that could beat the original R function. Or is it?

The data I used to check both versions is downloaded from a github repo:

netflix_titles = CSV.read(Downloads.download("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-20/netflix_titles.csv"), DataFrame);

And the function call is: unnest_tokens3(netflix_titles, :description, :word). The corresponding R function call is unnest_tokens(netflix_titles, word, description).

I @btimed both functions, tidytext’s and mine, on these same data and tidytext’s version is >10 times faster:

julia> @btime R"unnest_tokens($netflix_titles, word, description)";
  90.973 ms (39119 allocations: 1023.95 KiB)

julia> @btime unnest_tokens3(netflix_titles, :description, :word);
  1.077 s (3685252 allocations: 199.77 MiB)

Any idea as to what can be improved/changed or at least help me understand why R’s function is so much better would be greatly appreciated.

Thanks!

Oscar_Smith · June 19, 2023, 11:22pm

One easy improvement is reduce(vcat, token_list) instead of vcat(token_list...).
Other than that, you would need to profile to see where the time is being spent.

mrufsvold · June 20, 2023, 2:13am

Looks like the allocations lie mostly with TextAnalysis.jl (I’m assuming that’s your dependency for StringDocument and tokens.

julia> just_tokenize_old(input_col) = tokenize.(input_col)
julia> @btime just_tokenize_old.($(netflix_titles.description))
  1.046 s (3592788 allocations: 193.39 MiB)

Without finding or writing a more efficient tokenization algorithm, fiddling around with your code isn’t really going to make a dent. I tried using WordTokenizers.jl:

using WordTokenizers
just_tokenize_nltk(input_col) = nltk_word_tokenize.(input_col)
@btime just_tokenize_nltk.($(netflix_titles.description))
  183.493 ms (275786 allocations: 16.32 MiB)

function unnest_tokens4(df::DataFrame, input_col::Symbol, output_col::Symbol)
    token_list = nltk_word_tokenize.(df[!, input_col])
    df = select(df, Not(input_col))

    flat_token_list = reduce(vcat, token_list)
    repeat_lengths = length.(token_list)

    repeat_indices = Vector{Int}(undef, sum(repeat_lengths))
    counter = 1
    @inbounds for i in eachindex(repeat_lengths)
        repeat_indices[counter:repeat_lengths[i]+(counter-1)] .= i
        counter += repeat_lengths[i]
    end

    new_df = @view df[repeat_indices, :]
    new_df[!, output_col] = flat_token_list

    return new_df
end

@btime unnest_tokens4($netflix_titles, $(:description), $(:word))
  193.152 ms (352671 allocations: 22.01 MiB)

I can’t test the R solution on my computer, so I don’t know how it compares directly, but I got almost the exact same timing for unnest_tokens3, so I’d assume that you’ll still be about 2 times slower with Julia.

It’s still the case that tokenization is the main driver of the allocations here. I wish there was an API for a token Generator where you could preallocate an empty String[], give it a sizehint!, and then push! each successive token that is found in the input string. That way we wouldn’t have to allocate all these vectors of tokens. Better yet, the generator could return SubString{String} views of the input so we wouldn’t need new strings for every token.

Alex_Tantos · June 20, 2023, 6:16am

Thanks! I had tried reduce(vcat, token_list), but it did not seem to make much of a difference. As far as I could see and interpret the profiling output, the tokenization algorithm creates the biggest part of the delay and the allocations, as @mrufsvold also mentioned in his helpful response.

Alex_Tantos · June 20, 2023, 6:19am

Thank you very much for your helpful respose! Now, comparing the two I end up having only a ~2 times faster R function compared to the Julia function you improved and added in your response. I will not mark your response as Solution for now, just in case, somebody else would like to add something on the API you mention.

julia> @btime unnest_tokens4(netflix_titles, :description, :word);
  206.538 ms (352671 allocations: 22.01 MiB)

julia> @btime R"unnest_tokens($netflix_titles, word, description)";
  90.575 ms (39119 allocations: 1023.95 KiB)

Alex_Tantos · June 20, 2023, 7:58am

UPDATE 1 : If we choose to use the punctuation_space_tokenize() function over nltk_word_tokenize(), we observe a significant improvement in speed. This change makes the Julia function almost twice as fast compared to the R version! The specific function “Tokenizes by removing punctuation, unless it occurs inside of a word.” and this behavior precisely mirrors what the R unnest_tokens() function does.

UPDATE 2 : The use of @view in the new_df = @view df[repeat_indices, :] creates problems and has to go before the function works as expected.

So far, the Julia version of the funciton is as follows:

function unnest_tokens4(df::DataFrame, input_col::Symbol, output_col::Symbol)
    token_list = punctuation_space_tokenize.(df[!, input_col])
    df = select(df, Not(input_col))

    flat_token_list = reduce(vcat, token_list)
    repeat_lengths = length.(token_list)

    repeat_indices = Vector{Int}(undef, sum(repeat_lengths))
    counter = 1
    @inbounds for i in eachindex(repeat_lengths)
        repeat_indices[counter:repeat_lengths[i]+(counter-1)] .= i
        counter += repeat_lengths[i]
    end

    new_df = df[repeat_indices, :]
    new_df[!, output_col] = flat_token_list

    return new_df
end

And here the @btimes:

julia> @btime unnest_tokens4(netflix_titles, :description, :word);
  64.411 ms (162586 allocations: 40.57 MiB)

julia> @btime R"unnest_tokens($netflix_titles, word, description)";
  100.188 ms (39119 allocations: 1023.95 KiB)

However, I still believe there is room for improvement wrt the tokenization API that @mrufsvold mentions and is related to the other much slower (and possibly more complex) tokenization functions related to StringDouments that belong to the TextAnalysis package. Since, tokenization is crucial for text processing and NLP in general, I would be thrilled to see any opinion on the issue in the very next days, before I mark the topic as solved by @mrufsvold.

Topic		Replies	Views
Writing a fast nlp tokenizer in Julia Performance nlp	12	2387	February 2, 2021
Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary Performance question	27	2199	December 15, 2021
Parse a string using multiple delimiters New to Julia	1	3436	July 22, 2017
Things that are easier in Julia than Python/R etc Community python , r	60	6999	October 17, 2021
A living post of Julia vs R's data manipulation tasks speeds Data data	21	7777	August 27, 2021

Creating tidytext's unnest_tokens() in Julia and speed

Related topics