I am trying to push Tokendocument into a corpus from the textanlysis.jl package.
Here is the code
y = Corpus([])
for i in data.row
x = TokenDocument([m.match for m = eachmatch(r"\\[rnt](*SKIP)(*F)|\w+(?:['-,-:\/.(X X)]\w+)*",data.Omschrijving[i] )])
push!(y, x)
end
the result is still an empty corpus but with X amount of documents like this:
A Corpus with 1730 documents:
* 0 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
As you can see, I cannot enter all my documents into the corpus with push!() or by writing it out.
Does anyone have some ideas about resolving this?
Already found the issue.
Apparently, TokenDocument cannot process SubString{String} type which I got from the iteration on eachmatch. instead, it gives an empty TokenDocument and thereby an empty corpus.
Here is some code for reproducibility:
using TextAnalysis
a = "Vader: hahahahahaha, ja genoeg gelachen maar was wel lachen"
b = "Luke: Was helemaal niet grappig "
c = "Vader: hou je klep!"
d = "Luke: Je ben niet grappig! "
e = "Vader: Ik ben niet alleen een komiek, maar ik ben ook je oma. je oma ja... hehe"
f = "Luke: Nee dat is niet waar. Ik heb geen OMA! "
g = "Vader: Toch waar he. Ja hoor spring maar hoor"
h = "Luke: Ik hoop dat je zal branden in de hel"
vector = [a,b,c,d,e,f,g,h]
y = Corpus([])
for i in 1:length(vector)
x = TokenDocument([m.match for m = eachmatch(r"\\[rnt](*SKIP)(*F)|\w+(?:['-,-:\/.(X X)]\w+)*",vector[i])])
push!(y, x)
end
Solution:
y = Corpus([])
for i in 1:length(vector)
x = StringDocument(vector[i])
push!(y, x)
end