Announcing UniversalDependencies.jl: A Julia representation of the Universal Dependencies data model for annotated linguistic data. (Status: Not yet registered. Currently on day 2 of the three-day General Registry waiting period.)
What is Universal Dependencies? From Wikipedia:
Universal Dependencies , frequently abbreviated as UD , is an international cooperative project to create a grammatical annotation framework and treebanks of the worldโs languages. [โฆ] Core applications are automated text processing in the field of natural language processing (NLP) and research into natural language syntax and grammar, especially within linguistic typology.
Among other things, this package contains read/write utilities for UDโs official file format, CoNLL-U, as well as some basic utilities for rendering that content in several ways. For example, given a CoNLL-U file containing this sentence:
# sent_id = weblog-3
# text = Highly recommended!
1 Highly highly ADV RB Degree=Pos 2 advmod 2:advmod _
2 recommended recommend VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 0 root 0:root SpaceAfter=No
3 ! ! PUNCT . _ 2 punct 2:punct _
With my package you can load it into Julia with:
treebank = UD.load("path_to_my_treebank.conllu")
And then render it in three different ways (itโs the third sentence in the file, hence the [3] indexing):
julia> render(ArcStyle(), treebank[3])
# sent_id = weblog-3
# text = Highly recommended!
root
โญโโpunctโโโโฎ
โญโโadvmodโโโฎ โ
Highly recommended !
ADV VERB PUNCT
julia> render(CompactStyle(), treebank[3])
# sent_id = weblog-3
# text = Highly recommended!
Highly recommended !
ADV VERB PUNCT
julia> render(TableStyle(), treebank[3])
# sent_id = weblog-3
# text = Highly recommended!
1 Highly highly ADV RB Degree=Pos 2 advmod 2:advmod _
2 recommended recommend VERB VBN Tense=Past|VerbForm=Part|Voice=Pass 0 root 0:root SpaceAfter=No
3 ! ! PUNCT . _ 2 punct 2:punct _
It can also be trivially converted to a DataFrame, either for the whole conllu file or for just an individual sentence as below:
julia> using DataFrames
julia> DataFrame(treebank[3])
3ร10 DataFrame
Row โ id form lemma upos xpos feats head deprel deps misc โฏ
โ NodeRef String String String String Features NodeRef String Enhancedโฆ Feat โฏ
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 Highly highly ADV RB Degree=Pos 2 advmod 2:advmod _ โฏ
2 โ 2 recommended recommend VERB VBN Tense=Past|VerbForm=Part|Voice=Pโฆ 0 root 0:root Spac
3 โ 3 ! ! PUNCT . _ 2 punct 2:punct _
1 column omitted
Also supported: editing and writing CoNLL-U to disk.
Coming soon:
- validation of UD rules
- comparison utilities (e.g. diffโing of two CoNLL-U files)
- support for additional formats like JSON, TEI, CoNLL-U+
Related: For anyone interested in this domain, I also invite you to preview my corpus query engine montre. Montre is an app in the Rust language, but the Julia bindings for it are under active development and should be coming very soon. Montre is designed to make it easy to do fast and powerful stats with UD data.