I am trying to build a corpus with a custom tokenizer in Julia but I cannot figure out how to do it. I tried using TextAnalysis.jl’s Corpus function in order to create the corpus, but it does not allow me to specify a tokenizer. I am wondering how I can make a corpus with a custom tokenizer in Julia.
If you have a function that produces a vector of strings from an initial string, then you should be able to pass the output token vector to a TokenDocument directly, instead of passing the original string. Then you can just make your Corpus from a list of TokenDocuments.
token_vector = my_tokenizer_function(a_document_string)
token_doc = TokenDocument(token_vector)
corpus = Corpus([token_doc])