Hello, I’m currently trying to use TextAnalysis.jl for the first time, and I can’t figure out this problem.
I can tokenize a string like so:
using TextAnalysis
test = "Hello, I like apples and oranges. I like ice-cream."
test_doc = TokenDocument(test)
And then, I can access unigrams as well as bigrams from the token object:
tokens(test_doc)
ngrams(test_doc, 2)
But what I can’t figure out is how to create a bigram (or ngram) document term matrix out of it? I know that first I have to convert it into a corpus object
crps = Corpus([test_doc])
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
dtm(m, :dense)
But I can’t figure out from the TextAnalysis documentation, is whether I can choose the ngram to create the dtm? It seems reasonable that I could since the information is already there in the tokenized object.