I am attempting to make a model for sensitivity analysis by only keeping the intersection of tokens between a test set and a training set of documents. To do this I am converting both sets of documents into Corpus and filtering out chars.
The issue at hand is the remove_words!( )
function. This takes in a corpus and a list of words to remove like so:
julia> str="The quick brown fox jumps over the lazy dog"
julia> sd=StringDocument(str);
julia> remove_words = ["fox", "over"]
julia> remove_words!(sd, remove_words)
julia> sd.text
"the quick brown jumps the lazy dog"
from the documentation.
However when I try it this error is thrown whenever I use a full array:
(TextHW2) julia> removal= symdiff(train_mat.terms, test_docs.terms )
20436-element Vector{String}:
"aaaa"
"aaaaaaaaaaaaaaand"
⋮
"zwischendurch"
"zzzzzzz"
(TextHW2) julia> remove_words!(train_docs, removal)
ERROR: PCRE compilation error: regular expression is too large at offset 161853
Stacktrace:
[1] error(s::String)
@ Base .\error.jl:35
[2] compile(pattern::String, options::UInt32)
@ Base.PCRE .\pcre.jl:155
[3] compile(regex::Regex)
@ Base .\regex.jl:82
[4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
@ Base .\regex.jl:47
[5] Regex
@ .\regex.jl:70 [inlined]
[6] mk_regex(regex_string::String)
@ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:31
[7] _combine_regex(regex_parts::Set{AbstractString})
@ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:547
[8] _build_regex(lang::Languages.English, flags::UInt32, patterns::Set{AbstractString}, words::Set{AbstractString})
@ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:542
[9] prepare!(crps::Corpus{StringDocument{String}}, flags::UInt32; skip_patterns::Set{AbstractString}, skip_words::Set{AbstractString})
@ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:414
[10] remove_words!(entity::Corpus{StringDocument{String}}, words::Vector{String})
@ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:227
[11] top-level scope
@ REPL[21]:1
This can be mostly avoided by individually iterating through the contents of the wanted removed words but is not perfect. Does anyone know why this error is happening?