KeemenaPreprocessing: text tokenization, normalization, and more

mantzaris · August 20, 2025, 2:04am

Really happy and excited to present KeemenaPreprocessing, a package to preprocess text, with normalization, filtering, offsets for byte/char/word/sentence/paragraph and even document. It will even add alignments for the different offset parsing. It allows users with consumer grade hardware to process text datasets larger than the memory by chunk streaming. It produces a neat bundle that can be easily saved/exported to JLD2:

KeemenaPreprocessing.jl (GitHub - mantzaris/KeemenaPreprocessing.jl: Preprocessing for text data: cleaning, normalization, vectorization, tokenization and more)

Text Cleaning & Normalization
Handles casing, punctuation, stopwords, unicode normalization, diacritic removal, and whitespace cleanup.
Tokenization at Multiple Levels
Supports characters, words, subwords, sentences, and documents, with consistent data structures.
Vocabulary Building
Frequency-based vocabulary extraction, pruning, and token-to-ID mapping.
Bundle System for Multi-Level Outputs
Produces a unified “bundle” object that stores representations at different levels (characters, tokens, sentences, etc.) for the same text.
Offsets for Precise Alignment
Maintains offsets so that positions in one representation (e.g., subwords) map back to another (e.g., original characters or words).
Useful for tasks like highlighting spans or aligning embeddings.
Cross-Level Alignment
Ensures that outputs across granularities (character → token → sentence → document) remain linked, allowing downstream models to use multiple levels of context simultaneously.
Efficient Streaming for Large Datasets
Preprocesses text in streams without requiring all data in memory.

‘Keemena’ is adopted from Greek which means ‘text’.

Also, if anyone is a JOSS editor can you please volunteer to review the package, [PRE REVIEW]: KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP · Issue #8568 · openjournals/joss-reviews · GitHub

Topic		Replies	Views
Writing a fast nlp tokenizer in Julia Performance nlp	12	2461	February 2, 2021
Nlp package with lemmatizer New to Julia question , package , nlp	1	793	May 28, 2020
How to subset a stream I/O and pass it to TensorFlow.jl? (question has been updated) Data question	61	4474	May 6, 2017
Noun chunks segmentation for text analysis General Usage package	0	218	October 8, 2021
Proposed overhaul of DataStreams Data	7	1395	May 25, 2017

KeemenaPreprocessing: text tokenization, normalization, and more

Related topics