KeemenaPreprocessing: text tokenization, normalization, and more

Really happy and excited to present KeemenaPreprocessing, a package to preprocess text, with normalization, filtering, offsets for byte/char/word/sentence/paragraph and even document. It will even add alignments for the different offset parsing. It allows users with consumer grade hardware to process text datasets larger than the memory by chunk streaming. It produces a neat bundle that can be easily saved/exported to JLD2:

KeemenaPreprocessing.jl (GitHub - mantzaris/KeemenaPreprocessing.jl: Preprocessing for text data: cleaning, normalization, vectorization, tokenization and more)

  • Text Cleaning & Normalization
    Handles casing, punctuation, stopwords, unicode normalization, diacritic removal, and whitespace cleanup.

  • Tokenization at Multiple Levels
    Supports characters, words, subwords, sentences, and documents, with consistent data structures.

  • Vocabulary Building
    Frequency-based vocabulary extraction, pruning, and token-to-ID mapping.

  • Bundle System for Multi-Level Outputs
    Produces a unified “bundle” object that stores representations at different levels (characters, tokens, sentences, etc.) for the same text.

  • Offsets for Precise Alignment
    Maintains offsets so that positions in one representation (e.g., subwords) map back to another (e.g., original characters or words).
    Useful for tasks like highlighting spans or aligning embeddings.

  • Cross-Level Alignment
    Ensures that outputs across granularities (character → token → sentence → document) remain linked, allowing downstream models to use multiple levels of context simultaneously.

  • Efficient Streaming for Large Datasets
    Preprocesses text in streams without requiring all data in memory.


‘Keemena’ is adopted from Greek which means ‘text’.

Also, if anyone is a JOSS editor can you please volunteer to review the package, [PRE REVIEW]: KeemenaPreprocessing.jl: Unicode-Robust Cleaning, Multi-Level Tokenisation & Streaming Offset Bundling for Julia NLP · Issue #8568 · openjournals/joss-reviews · GitHub

10 Likes

update on KeemenaPreprocessing.jl :

I have now released v0.1.2 , which adds subword support through KeemenaSubwords.jl .

So in addition to the earlier preprocessing, normalization, offsets, alignments, and streaming features, KeemenaPreprocessing.jl can now also act as a more complete Julia entry point for subword-based NLP.

What this adds at a high level:

  • subword tokenization from within KeemenaPreprocessing.jl
  • support for both tokenizer-native ids and bundle-reindexed subword vocabularies
  • streaming subword preprocessing for larger corpora
  • access to subword offsets, masks, token type ids, and metadata
  • support for built-in and local tokenizer sources
  • still allows explicit direct use of KeemenaSubwords.jl when finer control is needed

https://github.com/mantzaris/KeemenaPreprocessing.jl

Feedback is very welcome, especially from people working on tokenization, or corpus preparation.

1 Like