Writing a fast nlp tokenizer in Julia

You should probably take a look at

Both because it may be what you need already.
And because it’s implemented the data structures and API for fast parsing.
With the TokenBuffer API

You can check the paper.

It’s >4x faster than Spacy, and >6x faster than NLTK.

I suspect HuggingFace’s new tokenisers are faster still.
But not evaluated.

Though it is mostly rule-based tokenisers.
Especially as far as performance optimisation is concerned.
So might not be so relevant.