Writing a fast nlp tokenizer in Julia

oxinabox · January 30, 2021, 4:32pm

You should probably take a look at

Both because it may be what you need already.
And because it’s implemented the data structures and API for fast parsing.
With the TokenBuffer API

You can check the paper.

It’s >4x faster than Spacy, and >6x faster than NLTK.

I suspect HuggingFace’s new tokenisers are faster still.
But not evaluated.

Though it is mostly rule-based tokenisers.
Especially as far as performance optimisation is concerned.
So might not be so relevant.

Topic		Replies	Views
Implementation of Norvigs spellchecker (Code critique + performance help) Performance question , performance	21	2363	January 14, 2020
Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary Performance question	27	2389	December 15, 2021
Count words challenge Performance	24	2358	March 23, 2021
Creating tidytext's unnest_tokens() in Julia and speed Data dataframes , speed-optimization	5	591	June 20, 2023
Help to get my slow Julia code to run as fast as Rust/Java/Lisp Performance	100	5337	August 6, 2021