Push!() tokendocument into corpus

korilium · November 23, 2021, 10:06pm

Hello everybody,

I am trying to push Tokendocument into a corpus from the textanlysis.jl package.

Here is the code

y = Corpus([])
for i in data.row
 x = TokenDocument([m.match for m = eachmatch(r"\\[rnt](*SKIP)(*F)|\w+(?:['-,-:\/.(X X)]\w+)*",data.Omschrijving[i] )])
 push!(y, x)
end

the result is still an empty corpus but with X amount of documents like this:

A Corpus with 1730 documents:
 * 0 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

As you can see, I cannot enter all my documents into the corpus with push!() or by writing it out.
Does anyone have some ideas about resolving this?

Kind regards,

Korilium

goerch · November 23, 2021, 10:10pm

I’m not sure I’m able to reproduce this with the MWE. Could you simplify it, for example with some constant input from global variables?

korilium · November 23, 2021, 10:37pm

Hey georch,

Already found the issue.
Apparently, TokenDocument cannot process SubString{String} type which I got from the iteration on eachmatch. instead, it gives an empty TokenDocument and thereby an empty corpus.

Here is some code for reproducibility:

using TextAnalysis

a = "Vader: hahahahahaha, ja genoeg gelachen maar was wel lachen"
b = "Luke: Was helemaal niet grappig " 
c = "Vader: hou je klep!"
d = "Luke: Je ben niet grappig! "
e = "Vader: Ik ben niet alleen een komiek, maar ik ben ook je oma. je oma ja... hehe"
f = "Luke: Nee dat is niet waar. Ik heb geen OMA!   "
g = "Vader: Toch waar he. Ja hoor spring maar hoor"
h = "Luke: Ik hoop dat je zal branden in de hel"

vector = [a,b,c,d,e,f,g,h]

y = Corpus([])
for i in 1:length(vector)
 x = TokenDocument([m.match for m = eachmatch(r"\\[rnt](*SKIP)(*F)|\w+(?:['-,-:\/.(X X)]\w+)*",vector[i])])
 push!(y, x)
end

Solution:

y = Corpus([])
for i in 1:length(vector)
 x = StringDocument(vector[i])
 push!(y, x)
end

goerch · November 23, 2021, 10:43pm

Good to know! Thank you for the feedback.

Topic		Replies	Views
Creating a corpus with a custom tokenizer General Usage nlp , text-analysis	1	412	June 17, 2022
StringDocument() of the TextAnalysis package New to Julia question , strings	3	460	October 24, 2021
Generating an ngram document term matrix with TextAnalysis.jl New to Julia	2	662	June 21, 2021
Tokenising using TextAnalysis 0.8 Machine Learning text-analysis	1	79	September 19, 2024
Text Analysis API Error General Usage question , package	2	229	October 5, 2022

Push!() tokendocument into corpus

Related topics