KeemenaPreprocessing: text tokenization, normalization, and more

Really happy and excited to present KeemenaPreprocessing, a package to preprocess text, with normalization, filtering, offsets for byte/char/word/sentence/paragraph and even document. It will even add alignments for the different offset parsing. It allows users with consumer grade hardware to process text datasets larger than the memory by chunk streaming. It produces a neat bundle that can be easily saved/exported to JLD2:

KeemenaPreprocessing.jl (GitHub - mantzaris/KeemenaPreprocessing.jl: Preprocessing for text data: cleaning, normalization, vectorization, tokenization and more)

  • Text Cleaning & Normalization
    Handles casing, punctuation, stopwords, unicode normalization, diacritic removal, and whitespace cleanup.

  • Tokenization at Multiple Levels
    Supports characters, words, subwords, sentences, and documents, with consistent data structures.

  • Vocabulary Building
    Frequency-based vocabulary extraction, pruning, and token-to-ID mapping.

  • Bundle System for Multi-Level Outputs
    Produces a unified “bundle” object that stores representations at different levels (characters, tokens, sentences, etc.) for the same text.

  • Offsets for Precise Alignment
    Maintains offsets so that positions in one representation (e.g., subwords) map back to another (e.g., original characters or words).
    Useful for tasks like highlighting spans or aligning embeddings.

  • Cross-Level Alignment
    Ensures that outputs across granularities (character → token → sentence → document) remain linked, allowing downstream models to use multiple levels of context simultaneously.

  • Efficient Streaming for Large Datasets
    Preprocesses text in streams without requiring all data in memory.


‘Keemena’ is adopted from Greek which means ‘text’.

Also, if anyone is a JOSS editor can you please volunteer to review the package, [PRE REVIEW]: KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP · Issue #8568 · openjournals/joss-reviews · GitHub

6 Likes