Introducing Montre.jl
montre (/mɔ̃tʁ/): from French montrer, “to show.” The Latin root is monstrare, “to point out, indicate.” ☜
Montre is a fast, embeddable corpus query engine that I wrote in the Rust language. This package, Montre.jl, provides the Julia bindings for it, giving you CQL-like queries, concordances, and structured data extraction from the Julia REPL.
Who should try Montre?
Montre is for anyone working in NLP or corpus linguistics who has CoNLL-U files to analyze. CoNLL-U is the ubiquitous annotation format used by spaCy, Stanza, CoreNLP, and the Universal Dependencies project. Montre indexes CoNLL-U natively, and its data model was designed to map cleanly onto Universal Dependencies: every column is preserved as a named layer, multiword tokens and empty nodes retain their UD semantics, and the structural hierarchy — token, sentence, document — mirrors the organization of UD treebanks.
A quick tour
The examples below use a parallel French/English corpus of Guy de Maupassant’s short stories, shipped as a LazyArtifact:
using Montre
using LazyArtifacts
corpus = Montre.open(artifact"maupassant_corpus")
Corpus(1627916 tokens, 487 documents, 2 components)
The corpus contains about 300 stories in French, and about 180 in their public domain English translations:
components(corpus)
2-element Vector{Component}:
Component("maupassant-en", en, 589750 tokens)
Component("maupassant-fr", fr, 1038166 tokens)
It also contains queryable alignments between the translated stories, specifically ~35k sentence-to-sentence pairs produced by Google’s LaBSE:
alignments(corpus)
1-element Vector{Alignment}:
Alignment("labse", maupassant-fr → maupassant-en, 35003 edges)
Querying
At its simplest, a CQL query matches tokens by their annotations. The cql"..." string macro lets you write queries with single quotes — no escaping needed:
nouns = query(corpus, cql"[pos='NOUN']")
244184 hits across 487 documents
CQL patterns can include regex, boolean constraints, and layer combinations. It works somewhat like a regex engine, but instead of matching characters in raw text, you’re efficiently matching over linguistic annotations as well. For example, search for instances of the colors noir, blanc, rouge, bleu, or vert, where the part of speech (pos) is “ADJ”:
colors = query(
corpus,
cql"[lemma=/^(noir|blanc|rouge|bleu|vert)$/ & pos='ADJ']";
component = "maupassant-fr"
)
1530 hits across 267 documents
Concordance
A concordance — keyword in context (KWIC) — is the classic output of corpus query tools. Montre produces them in one call:
concordance(colors; limit = 3)
Concordance (3 lines)
25francs.conllu d' une flamme de cheveux ***rouges*** sur le sommet de le
25francs.conllu allait sur la grande route ***blanche*** , à le pas lent
25francs.conllu , vêtu d' un surplis ***blanc*** , et boitillant , entonner
You could produce something like this yourself with some wrangling, but Montre handles the context windows, document boundaries, and surface text reconstruction (multiword tokens, spacing) efficiently behind the scenes. Concordances also implement Tables.jl, so DataFrame(concordance(hits)) works directly.
Labels and constraints
CQL labels mark subspans within a match, and global constraints express relationships between them. This query finds coordinated noun pairs of the form “X and Y” in the English component, storing the left- and right-hand sides of the results in named captures a and b respectively:
pairs = query(
corpus,
CQL("a:[pos='NOUN'] [lemma='and'] b:[pos='NOUN']");
component = "maupassant-en"
)
904 hits across 173 documents
Labels also enable global constraints — conditions that relate labeled positions to each other. This query finds a singular noun followed within 10 tokens by the plural of the same lemma:
echoes = query(
corpus,
CQL("a:[pos='NOUN' & feats.Number='Sing'] []{0,10} b:[pos='NOUN' & feats.Number='Plur'] :: a.lemma = b.lemma");
component = "maupassant-en",
)
89 hits across 66 documents
Hit 1 (a_cremation.conllu) custom as yet to our customs
Hit 2 (a_cremation.conll) flame and burning with long blue flames
Hit 3 (a_family.conllu) lady in curls and flounces , one of those ladies
Extracting into DataFrames
extract bridges query results and DataFrames. You specify which layers to pull and how to reduce each span to a value:
using DataFrames
df = extract(
pairs,
DataFrame,
(x -> only(x["a", :lemma])) => :left,
(x -> only(x["b", :lemma])) => :right,
:document,
)
From there, the data is just an ordinary DataFrame. For example, you can use DataFramesMeta.jl transformations to get the most common noun pairings in the English Maupassant component:
using Chain
using DataFramesMeta
@chain df begin
@rtransform(:pair = :left * " and " * :right)
groupby(:pair)
@combine(:count = length(:pair))
@orderby(-:count)
first(10)
end
10×2 DataFrame
Row │ pair count
│ String Int64
─────┼────────────────────────────
1 │ day and night 16
2 │ father and mother 14
3 │ bread and butter 9
4 │ husband and wife 8
5 │ man and woman 8
6 │ morning and evening 6
7 │ plate and dish 5
8 │ hand and knee 5
9 │ arm and leg 5
10 │ hand and foot 4
Parallel corpus support
Montre treats a parallel corpus as a single artifact with multiple components and named alignment relations between them — not two corpora glued together at query time. The Maupassant corpus above has French and English components with LaBSE sentence alignments. Basic alignment projection and coverage analysis are available now; I’m still working out the most natural shape for these operations in the Julia interface. More on this in a future release.
What’s next
- Contrastive alignment queries: constraints that span across aligned components
- A TUI for interactive exploration (with accessibility features including voice input for CQL)
- Support for additional input formats (VRT, Stanza JSON, TEI XML)
- Statistical and grouping operations
Montre is open source under Apache-2.0. Feedback, issues, and contributions welcome. Please keep in mind, it’s still in early development, and the API is expected to change.
Registration status: Montre.jl should hit the General Registry on Thursday morning.