Hello, I’d like to do some basic NLP tasks (POS Tagging, lemmatization) in languages other than English. . Or, more specifically, German.
Currently, I am using SpaCy through PyCall with the german model. This is fine, however I need to tag lot sof items and this is pretty slow. I have not been able to paralellize it due to the GIL, however I think there might be ways to do this. So I guess I have the following options:
-
Keep using Python, try to use
Distributed
andpmap
as shown here: Run multiple python instances with pycall in different threads - #2 by cjdoris This might require some heavy restructuring in my code: I have very long chains of functions with the python calls intertwinned. I don’t know if I can just slap@everywhere
in front of every single function. -
Try to use Transformers.jl with some state of the art model. However, I am kind of clueless as to how do to that. There is an example to encode a sentence, but not sure how to get from there to somewhere practical.
-
Since I don’t really need state-of-the-art but just something workable, I could port a simple Python package that achieves these tasks. This is tedious but at least a known quantity.
Does anyone have any insights or recommendations? Some package I overlooked or a simple solution?
Thanks in advance and I hope the post is not too unfocused.