Writing a fast nlp tokenizer in Julia

You could use codeunits to treat any string as a vector of Unicode codepoints, so you don’t have to replace anything. You could then fill an array of the same length like your codeunits vector with token “membership” information. This should be reasonably efficient and avoids repeatedly inserting and deleting stuff.

Now I don’t know about BPE, does it allow for more than 2 adjacent characters to become a token or is it only searching for pairs?
In the latter case, iterating over all codeunits once should give allow you to create a co-occurence matrix pretty efficiently from which you should be able to compute the best pairs that you want to join to a common token