[ANN] UniversalDependencies.jl for representing annotated linguistic data

Announcing UniversalDependencies.jl: A Julia representation of the Universal Dependencies data model for annotated linguistic data. (Status: Not yet registered. Currently on day 2 of the three-day General Registry waiting period.)

What is Universal Dependencies? From Wikipedia:

Universal Dependencies , frequently abbreviated as UD , is an international cooperative project to create a grammatical annotation framework and treebanks of the worldโ€™s languages. [โ€ฆ] Core applications are automated text processing in the field of natural language processing (NLP) and research into natural language syntax and grammar, especially within linguistic typology.

Among other things, this package contains read/write utilities for UDโ€™s official file format, CoNLL-U, as well as some basic utilities for rendering that content in several ways. For example, given a CoNLL-U file containing this sentence:

# sent_id = weblog-3
# text = Highly recommended!
1  Highly   highly   ADV   RB Degree=Pos  2  advmod   2:advmod _
2  recommended recommend   VERB  VBN   Tense=Past|VerbForm=Part|Voice=Pass 0  root  0:root   SpaceAfter=No
3  !  !  PUNCT .  _  2  punct 2:punct  _

With my package you can load it into Julia with:

treebank = UD.load("path_to_my_treebank.conllu")

And then render it in three different ways (itโ€™s the third sentence in the file, hence the [3] indexing):

julia> render(ArcStyle(), treebank[3])
# sent_id = weblog-3
# text = Highly recommended!
             root
              โ•ญโ”€โ”€punctโ”€โ”€โ”€โ•ฎ
   โ•ญโ”€โ”€advmodโ”€โ”€โ•ฎ          โ”‚
Highly   recommended   !
ADV      VERB          PUNCT

julia> render(CompactStyle(), treebank[3])
# sent_id = weblog-3
# text = Highly recommended!
Highly  recommended  !
ADV     VERB         PUNCT

julia> render(TableStyle(), treebank[3])
# sent_id = weblog-3
# text = Highly recommended!
1  Highly       highly     ADV    RB   Degree=Pos                           2  advmod  2:advmod  _
2  recommended  recommend  VERB   VBN  Tense=Past|VerbForm=Part|Voice=Pass  0  root    0:root    SpaceAfter=No
3  !            !          PUNCT  .    _                                    2  punct   2:punct   _

It can also be trivially converted to a DataFrame, either for the whole conllu file or for just an individual sentence as below:

julia> using DataFrames
julia> DataFrame(treebank[3])
3ร—10 DataFrame
 Row โ”‚ id       form         lemma      upos    xpos    feats                              head     deprel  deps       misc โ‹ฏ
     โ”‚ NodeRef  String       String     String  String  Features                           NodeRef  String  Enhancedโ€ฆ  Feat โ‹ฏ
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ 1        Highly       highly     ADV     RB      Degree=Pos                         2        advmod  2:advmod   _    โ‹ฏ
   2 โ”‚ 2        recommended  recommend  VERB    VBN     Tense=Past|VerbForm=Part|Voice=Pโ€ฆ  0        root    0:root     Spac
   3 โ”‚ 3        !            !          PUNCT   .       _                                  2        punct   2:punct    _
                                                                                                             1 column omitted

Also supported: editing and writing CoNLL-U to disk.

Coming soon:

  • validation of UD rules
  • comparison utilities (e.g. diffโ€™ing of two CoNLL-U files)
  • support for additional formats like JSON, TEI, CoNLL-U+

Related: For anyone interested in this domain, I also invite you to preview my corpus query engine montre. Montre is an app in the Rust language, but the Julia bindings for it are under active development and should be coming very soon. Montre is designed to make it easy to do fast and powerful stats with UD data.

2 Likes