Creating tidytext's unnest_tokens() in Julia and speed

Hi.

I am trying to create a function that replicates the unnest_tokens() function in the tidytext R package. The version I came up with is the following:

function unnest_tokens3(df::DataFrame, input_col::Symbol, output_col::Symbol)
    tokenized = StringDocument.(df[!, input_col])
    df = select(df, Not(input_col))
    token_list = tokens.(tokenized)

    flat_token_list = vcat(token_list...)
    repeat_lengths = length.(token_list)

    repeat_indices = Vector{Int}(undef, sum(repeat_lengths))
    counter = 1
    for i in 1:length(repeat_lengths)
        repeat_indices[counter:repeat_lengths[i]+(counter-1)] .= i
        counter = repeat_lengths[i]+(counter)
    end

    new_df = @view df[repeat_indices, :]
    new_df[!, output_col] = flat_token_list

    return new_df
end

I could definitely do a bit better wrt code optimization, but I am not sure there is a version that could beat the original R function. Or is it?

The data I used to check both versions is downloaded from a github repo:

netflix_titles = CSV.read(Downloads.download("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-20/netflix_titles.csv"), DataFrame);

And the function call is: unnest_tokens3(netflix_titles, :description, :word). The corresponding R function call is unnest_tokens(netflix_titles, word, description).

I @btimed both functions, tidytext’s and mine, on these same data and tidytext’s version is >10 times faster:

julia> @btime R"unnest_tokens($netflix_titles, word, description)";
  90.973 ms (39119 allocations: 1023.95 KiB)
julia> @btime unnest_tokens3(netflix_titles, :description, :word);
  1.077 s (3685252 allocations: 199.77 MiB)

Any idea as to what can be improved/changed or at least help me understand why R’s function is so much better would be greatly appreciated.

Thanks!

One easy improvement is reduce(vcat, token_list) instead of vcat(token_list...).
Other than that, you would need to profile to see where the time is being spent.

1 Like

Looks like the allocations lie mostly with TextAnalysis.jl (I’m assuming that’s your dependency for StringDocument and tokens.

julia> just_tokenize_old(input_col) = tokenize.(input_col)
julia> @btime just_tokenize_old.($(netflix_titles.description))
  1.046 s (3592788 allocations: 193.39 MiB)

Without finding or writing a more efficient tokenization algorithm, fiddling around with your code isn’t really going to make a dent. I tried using WordTokenizers.jl:

using WordTokenizers
just_tokenize_nltk(input_col) = nltk_word_tokenize.(input_col)
@btime just_tokenize_nltk.($(netflix_titles.description))
  183.493 ms (275786 allocations: 16.32 MiB) 
function unnest_tokens4(df::DataFrame, input_col::Symbol, output_col::Symbol)
    token_list = nltk_word_tokenize.(df[!, input_col])
    df = select(df, Not(input_col))

    flat_token_list = reduce(vcat, token_list)
    repeat_lengths = length.(token_list)

    repeat_indices = Vector{Int}(undef, sum(repeat_lengths))
    counter = 1
    @inbounds for i in eachindex(repeat_lengths)
        repeat_indices[counter:repeat_lengths[i]+(counter-1)] .= i
        counter += repeat_lengths[i]
    end

    new_df = @view df[repeat_indices, :]
    new_df[!, output_col] = flat_token_list

    return new_df
end
@btime unnest_tokens4($netflix_titles, $(:description), $(:word))
  193.152 ms (352671 allocations: 22.01 MiB)

I can’t test the R solution on my computer, so I don’t know how it compares directly, but I got almost the exact same timing for unnest_tokens3, so I’d assume that you’ll still be about 2 times slower with Julia.

It’s still the case that tokenization is the main driver of the allocations here. I wish there was an API for a token Generator where you could preallocate an empty String[], give it a sizehint!, and then push! each successive token that is found in the input string. That way we wouldn’t have to allocate all these vectors of tokens. Better yet, the generator could return SubString{String} views of the input so we wouldn’t need new strings for every token.

3 Likes

Thanks! I had tried reduce(vcat, token_list), but it did not seem to make much of a difference. As far as I could see and interpret the profiling output, the tokenization algorithm creates the biggest part of the delay and the allocations, as @mrufsvold also mentioned in his helpful response.

Thank you very much for your helpful respose! Now, comparing the two I end up having only a ~2 times faster R function compared to the Julia function you improved and added in your response. I will not mark your response as Solution for now, just in case, somebody else would like to add something on the API you mention.

julia> @btime unnest_tokens4(netflix_titles, :description, :word);
  206.538 ms (352671 allocations: 22.01 MiB)

julia> @btime R"unnest_tokens($netflix_titles, word, description)";
  90.575 ms (39119 allocations: 1023.95 KiB)

1 Like

UPDATE 1 : If we choose to use the punctuation_space_tokenize() function over nltk_word_tokenize(), we observe a significant improvement in speed. This change makes the Julia function almost twice as fast compared to the R version! The specific function “Tokenizes by removing punctuation, unless it occurs inside of a word.” and this behavior precisely mirrors what the R unnest_tokens() function does.

UPDATE 2 : The use of @view in the new_df = @view df[repeat_indices, :] creates problems and has to go before the function works as expected.

So far, the Julia version of the funciton is as follows:

function unnest_tokens4(df::DataFrame, input_col::Symbol, output_col::Symbol)
    token_list = punctuation_space_tokenize.(df[!, input_col])
    df = select(df, Not(input_col))

    flat_token_list = reduce(vcat, token_list)
    repeat_lengths = length.(token_list)

    repeat_indices = Vector{Int}(undef, sum(repeat_lengths))
    counter = 1
    @inbounds for i in eachindex(repeat_lengths)
        repeat_indices[counter:repeat_lengths[i]+(counter-1)] .= i
        counter += repeat_lengths[i]
    end

    new_df = df[repeat_indices, :]
    new_df[!, output_col] = flat_token_list

    return new_df
end

And here the @btimes:

julia> @btime unnest_tokens4(netflix_titles, :description, :word);
  64.411 ms (162586 allocations: 40.57 MiB)

julia> @btime R"unnest_tokens($netflix_titles, word, description)";
  100.188 ms (39119 allocations: 1023.95 KiB)

However, I still believe there is room for improvement wrt the tokenization API that @mrufsvold mentions and is related to the other much slower (and possibly more complex) tokenization functions related to StringDouments that belong to the TextAnalysis package. Since, tokenization is crucial for text processing and NLP in general, I would be thrilled to see any opinion on the issue in the very next days, before I mark the topic as solved by @mrufsvold.

1 Like