[ANN] Montre.jl: bindings for a new corpus query engine for NLP

myersm0 · April 8, 2026, 1:32am

Introducing Montre.jl

montre (/mɔ̃tʁ/): from French montrer, “to show.” The Latin root is monstrare, “to point out, indicate.” ☜

Montre is a fast, embeddable corpus query engine that I wrote in the Rust language. This package, Montre.jl, provides the Julia bindings for it, giving you CQL-like queries, concordances, and structured data extraction from the Julia REPL.

Who should try Montre?

Montre is for anyone working in NLP or corpus linguistics who has CoNLL-U files to analyze. CoNLL-U is the ubiquitous annotation format used by spaCy, Stanza, CoreNLP, and the Universal Dependencies project. Montre indexes CoNLL-U natively, and its data model was designed to map cleanly onto Universal Dependencies: every column is preserved as a named layer, multiword tokens and empty nodes retain their UD semantics, and the structural hierarchy — token, sentence, document — mirrors the organization of UD treebanks.

A quick tour

The examples below use a parallel French/English corpus of Guy de Maupassant’s short stories, shipped as a LazyArtifact:

using Montre
using LazyArtifacts

corpus = Montre.open(artifact"maupassant_corpus")

Corpus(1627916 tokens, 487 documents, 2 components)

The corpus contains about 300 stories in French, and about 180 in their public domain English translations:

components(corpus)

2-element Vector{Component}:
 Component("maupassant-en", en, 589750 tokens)
 Component("maupassant-fr", fr, 1038166 tokens)

It also contains queryable alignments between the translated stories, specifically ~35k sentence-to-sentence pairs produced by Google’s LaBSE:

alignments(corpus)

1-element Vector{Alignment}:
 Alignment("labse", maupassant-fr → maupassant-en, 35003 edges)

Querying

At its simplest, a CQL query matches tokens by their annotations. The cql"..." string macro lets you write queries with single quotes — no escaping needed:

nouns = query(corpus, cql"[pos='NOUN']")

244184 hits across 487 documents

CQL patterns can include regex, boolean constraints, and layer combinations. It works somewhat like a regex engine, but instead of matching characters in raw text, you’re efficiently matching over linguistic annotations as well. For example, search for instances of the colors noir, blanc, rouge, bleu, or vert, where the part of speech (pos) is “ADJ”:

colors = query(
    corpus, 
    cql"[lemma=/^(noir|blanc|rouge|bleu|vert)$/ & pos='ADJ']"; 
    component = "maupassant-fr"
)

1530 hits across 267 documents

Concordance

A concordance — keyword in context (KWIC) — is the classic output of corpus query tools. Montre produces them in one call:

concordance(colors; limit = 3)

Concordance (3 lines)
25francs.conllu    d' une flamme de cheveux ***rouges*** sur le sommet de le
25francs.conllu  allait sur la grande route ***blanche*** , à le pas lent
25francs.conllu        , vêtu d' un surplis ***blanc*** , et boitillant , entonner

You could produce something like this yourself with some wrangling, but Montre handles the context windows, document boundaries, and surface text reconstruction (multiword tokens, spacing) efficiently behind the scenes. Concordances also implement Tables.jl, so DataFrame(concordance(hits)) works directly.

Labels and constraints

CQL labels mark subspans within a match, and global constraints express relationships between them. This query finds coordinated noun pairs of the form “X and Y” in the English component, storing the left- and right-hand sides of the results in named captures a and b respectively:

pairs = query(
    corpus, 
    CQL("a:[pos='NOUN'] [lemma='and'] b:[pos='NOUN']"); 
    component = "maupassant-en"
)

904 hits across 173 documents

Labels also enable global constraints — conditions that relate labeled positions to each other. This query finds a singular noun followed within 10 tokens by the plural of the same lemma:

echoes = query(
    corpus,
    CQL("a:[pos='NOUN' & feats.Number='Sing'] []{0,10} b:[pos='NOUN' & feats.Number='Plur'] :: a.lemma = b.lemma");
    component = "maupassant-en",
)

89 hits across 66 documents
Hit 1 (a_cremation.conllu)     custom as yet to our customs 
Hit 2 (a_cremation.conll)      flame and burning with long blue flames 
Hit 3 (a_family.conllu)        lady in curls and flounces , one of those ladies

Extracting into DataFrames

extract bridges query results and DataFrames. You specify which layers to pull and how to reduce each span to a value:

using DataFrames

df = extract(
    pairs,
    DataFrame,
    (x -> only(x["a", :lemma])) => :left,
    (x -> only(x["b", :lemma])) => :right,
    :document,
)

From there, the data is just an ordinary DataFrame. For example, you can use DataFramesMeta.jl transformations to get the most common noun pairings in the English Maupassant component:

using Chain
using DataFramesMeta

@chain df begin
    @rtransform(:pair = :left * " and " * :right)
    groupby(:pair)
    @combine(:count = length(:pair))
    @orderby(-:count)
    first(10)
end

10×2 DataFrame
 Row │ pair                 count
     │ String               Int64
─────┼────────────────────────────
   1 │ day and night           16
   2 │ father and mother       14
   3 │ bread and butter         9
   4 │ husband and wife         8
   5 │ man and woman            8
   6 │ morning and evening      6
   7 │ plate and dish           5
   8 │ hand and knee            5
   9 │ arm and leg              5
  10 │ hand and foot            4

Parallel corpus support

Montre treats a parallel corpus as a single artifact with multiple components and named alignment relations between them — not two corpora glued together at query time. The Maupassant corpus above has French and English components with LaBSE sentence alignments. Basic alignment projection and coverage analysis are available now; I’m still working out the most natural shape for these operations in the Julia interface. More on this in a future release.

What’s next

Contrastive alignment queries: constraints that span across aligned components
A TUI for interactive exploration (with accessibility features including voice input for CQL)
Support for additional input formats (VRT, Stanza JSON, TEI XML)
Statistical and grouping operations

Montre is open source under Apache-2.0. Feedback, issues, and contributions welcome. Please keep in mind, it’s still in early development, and the API is expected to change.

Registration status: Montre.jl should hit the General Registry on Thursday morning.

Topic		Replies	Views
[ANN] UniversalDependencies.jl for representing annotated linguistic data Package Announcements nlp	0	101	March 29, 2026
Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary Performance question	27	2462	December 15, 2021
Noun chunks segmentation for text analysis General Usage package	0	229	October 8, 2021
CoreNLP - Base Name Conflict General Usage nlp	68	5728	October 6, 2017
[ANN] LLMTextAnalysis.jl - Unveil Text Insights with LLMs! Package Announcements announcement , llm , generative-ai	1	671	January 17, 2024