Writing a fast nlp tokenizer in Julia

FPGro · January 29, 2021, 3:50pm

You could use codeunits to treat any string as a vector of Unicode codepoints, so you don’t have to replace anything. You could then fill an array of the same length like your codeunits vector with token “membership” information. This should be reasonably efficient and avoids repeatedly inserting and deleting stuff.

Now I don’t know about BPE, does it allow for more than 2 adjacent characters to become a token or is it only searching for pairs?
In the latter case, iterating over all codeunits once should give allow you to create a co-occurence matrix pretty efficiently from which you should be able to compute the best pairs that you want to join to a common token

Topic		Replies	Views
Creating a corpus with a custom tokenizer General Usage nlp , text-analysis	1	407	June 17, 2022
Writing a parser in Julia General Usage	10	7444	August 30, 2018
Tokenising using TextAnalysis 0.8 Machine Learning text-analysis	1	77	September 19, 2024
Pygments and Tokenize.jl Community	1	877	January 9, 2018
Discourse syntax highlighting options Meta Discussion	2	600	January 30, 2021

Writing a fast nlp tokenizer in Julia

Related topics