Generating an ngram document term matrix with TextAnalysis.jl

PeterTolochko · February 18, 2021, 7:21pm

Hello, I’m currently trying to use TextAnalysis.jl for the first time, and I can’t figure out this problem.

I can tokenize a string like so:

using TextAnalysis
test = "Hello, I like apples and oranges. I like ice-cream."
test_doc = TokenDocument(test)

And then, I can access unigrams as well as bigrams from the token object:

tokens(test_doc)
ngrams(test_doc, 2)

But what I can’t figure out is how to create a bigram (or ngram) document term matrix out of it? I know that first I have to convert it into a corpus object

crps = Corpus([test_doc])
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
dtm(m, :dense)

But I can’t figure out from the TextAnalysis documentation, is whether I can choose the ngram to create the dtm? It seems reasonable that I could since the information is already there in the tokenized object.

joaomacalos · June 19, 2021, 6:28am

I recently wanted to do the same thing. I solved it by using NGramDocument() instead of TokenDocument():

using TextAnalysis
test = "Hello, I like apples and oranges. I like ice-cream."
test_doc = NGramDocument(test, 2)

crps = Corpus([test_doc])
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
dtm(m, :dense)

PeterTolochko · June 21, 2021, 12:49pm

Hi, yep, later I figured it too, but did not reply back to my comment.
Thanks for pointing out the solution!

Topic		Replies	Views
Creating a corpus with a custom tokenizer General Usage nlp , text-analysis	1	408	June 17, 2022
Method Error and Corpus Creation New to Julia nlp	1	381	April 6, 2022
Tokenising using TextAnalysis 0.8 Machine Learning text-analysis	1	77	September 19, 2024
Push!() tokendocument into corpus New to Julia question	3	354	November 23, 2021
TextAnalysis 0.7.2 - where to find SentimentAnalyzer() Machine Learning	4	407	March 20, 2021

Generating an ngram document term matrix with TextAnalysis.jl

Related topics