Really happy and excited to present KeemenaPreprocessing
, a package to preprocess text, with normalization, filtering, offsets for byte/char/word/sentence/paragraph and even document. It will even add alignments for the different offset parsing. It allows users with consumer grade hardware to process text datasets larger than the memory by chunk streaming. It produces a neat bundle that can be easily saved/exported to JLD2:
KeemenaPreprocessing.jl (GitHub - mantzaris/KeemenaPreprocessing.jl: Preprocessing for text data: cleaning, normalization, vectorization, tokenization and more)
-
Text Cleaning & Normalization
Handles casing, punctuation, stopwords, unicode normalization, diacritic removal, and whitespace cleanup. -
Tokenization at Multiple Levels
Supports characters, words, subwords, sentences, and documents, with consistent data structures. -
Vocabulary Building
Frequency-based vocabulary extraction, pruning, and token-to-ID mapping. -
Bundle System for Multi-Level Outputs
Produces a unified “bundle” object that stores representations at different levels (characters, tokens, sentences, etc.) for the same text. -
Offsets for Precise Alignment
Maintains offsets so that positions in one representation (e.g., subwords) map back to another (e.g., original characters or words).
Useful for tasks like highlighting spans or aligning embeddings. -
Cross-Level Alignment
Ensures that outputs across granularities (character → token → sentence → document) remain linked, allowing downstream models to use multiple levels of context simultaneously. -
Efficient Streaming for Large Datasets
Preprocesses text in streams without requiring all data in memory.
‘Keemena’ is adopted from Greek which means ‘text’.
Also, if anyone is a JOSS editor can you please volunteer to review the package, [PRE REVIEW]: KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP · Issue #8568 · openjournals/joss-reviews · GitHub