Creating a corpus with a custom tokenizer

Jack_N · June 17, 2022, 2:15am

I am trying to build a corpus with a custom tokenizer in Julia but I cannot figure out how to do it. I tried using TextAnalysis.jl’s Corpus function in order to create the corpus, but it does not allow me to specify a tokenizer. I am wondering how I can make a corpus with a custom tokenizer in Julia.

Thanks,

Jack

awasserman · June 17, 2022, 4:05am

If you have a function that produces a vector of strings from an initial string, then you should be able to pass the output token vector to a TokenDocument directly, instead of passing the original string. Then you can just make your Corpus from a list of TokenDocuments.

token_vector  = my_tokenizer_function(a_document_string)
token_doc = TokenDocument(token_vector)
corpus = Corpus([token_doc])

Topic		Replies	Views
Push!() tokendocument into corpus New to Julia question	3	352	November 23, 2021
Tokenising using TextAnalysis 0.8 Machine Learning text-analysis	1	77	September 19, 2024
Generating an ngram document term matrix with TextAnalysis.jl New to Julia	2	659	June 21, 2021
Method Error and Corpus Creation New to Julia nlp	1	381	April 6, 2022
Writing a fast nlp tokenizer in Julia Performance nlp	12	2387	February 2, 2021

Creating a corpus with a custom tokenizer

Related topics