Push!() tokendocument into corpus

Hello everybody,

I am trying to push Tokendocument into a corpus from the textanlysis.jl package.

Here is the code

y = Corpus([])
for i in data.row
 x = TokenDocument([m.match for m = eachmatch(r"\\[rnt](*SKIP)(*F)|\w+(?:['-,-:\/.(X X)]\w+)*",data.Omschrijving[i] )])
 push!(y, x)
end 

the result is still an empty corpus but with X amount of documents like this:

A Corpus with 1730 documents:
 * 0 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

As you can see, I cannot enter all my documents into the corpus with push!() or by writing it out.
Does anyone have some ideas about resolving this?

Kind regards,

Korilium

I’m not sure I’m able to reproduce this with the MWE. Could you simplify it, for example with some constant input from global variables?

Hey georch,

Already found the issue.
Apparently, TokenDocument cannot process SubString{String} type which I got from the iteration on eachmatch. instead, it gives an empty TokenDocument and thereby an empty corpus.

Here is some code for reproducibility:

using TextAnalysis

a = "Vader: hahahahahaha, ja genoeg gelachen maar was wel lachen"
b = "Luke: Was helemaal niet grappig " 
c = "Vader: hou je klep!"
d = "Luke: Je ben niet grappig! "
e = "Vader: Ik ben niet alleen een komiek, maar ik ben ook je oma. je oma ja... hehe"
f = "Luke: Nee dat is niet waar. Ik heb geen OMA!   "
g = "Vader: Toch waar he. Ja hoor spring maar hoor"
h = "Luke: Ik hoop dat je zal branden in de hel"

vector = [a,b,c,d,e,f,g,h]

y = Corpus([])
for i in 1:length(vector)
 x = TokenDocument([m.match for m = eachmatch(r"\\[rnt](*SKIP)(*F)|\w+(?:['-,-:\/.(X X)]\w+)*",vector[i])])
 push!(y, x)
end 

Solution:

y = Corpus([])
for i in 1:length(vector)
 x = StringDocument(vector[i])
 push!(y, x)
end 
1 Like

Good to know! Thank you for the feedback.