Text Analysis API Error

I am attempting to make a model for sensitivity analysis by only keeping the intersection of tokens between a test set and a training set of documents. To do this I am converting both sets of documents into Corpus and filtering out chars.

The issue at hand is the remove_words!( ) function. This takes in a corpus and a list of words to remove like so:

julia> str="The quick brown fox jumps over the lazy dog"
julia> sd=StringDocument(str);
julia> remove_words = ["fox", "over"]
julia> remove_words!(sd, remove_words)
julia> sd.text
"the quick brown   jumps   the lazy dog"

from the documentation.

However when I try it this error is thrown whenever I use a full array:

(TextHW2) julia> removal= symdiff(train_mat.terms, test_docs.terms )
20436-element Vector{String}:
 "aaaa"
 "aaaaaaaaaaaaaaand"
 ⋮
 "zwischendurch"
 "zzzzzzz"


(TextHW2) julia> remove_words!(train_docs, removal)
ERROR: PCRE compilation error: regular expression is too large at offset 161853
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:35
  [2] compile(pattern::String, options::UInt32)
    @ Base.PCRE .\pcre.jl:155
  [3] compile(regex::Regex)
    @ Base .\regex.jl:82
  [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
    @ Base .\regex.jl:47
  [5] Regex
    @ .\regex.jl:70 [inlined]
  [6] mk_regex(regex_string::String)
    @ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:31
  [7] _combine_regex(regex_parts::Set{AbstractString})
    @ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:547
  [8] _build_regex(lang::Languages.English, flags::UInt32, patterns::Set{AbstractString}, words::Set{AbstractString})
    @ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:542
  [9] prepare!(crps::Corpus{StringDocument{String}}, flags::UInt32; skip_patterns::Set{AbstractString}, skip_words::Set{AbstractString})
    @ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:414
 [10] remove_words!(entity::Corpus{StringDocument{String}}, words::Vector{String})
    @ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:227
 [11] top-level scope
    @ REPL[21]:1

This can be mostly avoided by individually iterating through the contents of the wanted removed words but is not perfect. Does anyone know why this error is happening?

 regular expression is too large at offset 161853

basically by putting all words into a long regex it exceeded the limit…

1 Like

Thanks, I figured it out last night after realizing I could break up the removal array into chunks like 1:3000 and so on. Unfortunately the method does not allow for views so I’ll have to live with allocating these haha. Anyways thank you for the clarification!