Text Analysis API Error

nosewitz · October 5, 2022, 12:03am

I am attempting to make a model for sensitivity analysis by only keeping the intersection of tokens between a test set and a training set of documents. To do this I am converting both sets of documents into Corpus and filtering out chars.

The issue at hand is the remove_words!( ) function. This takes in a corpus and a list of words to remove like so:

julia> str="The quick brown fox jumps over the lazy dog"
julia> sd=StringDocument(str);
julia> remove_words = ["fox", "over"]
julia> remove_words!(sd, remove_words)
julia> sd.text
"the quick brown   jumps   the lazy dog"

from the documentation.

However when I try it this error is thrown whenever I use a full array:

(TextHW2) julia> removal= symdiff(train_mat.terms, test_docs.terms )
20436-element Vector{String}:
 "aaaa"
 "aaaaaaaaaaaaaaand"
 ⋮
 "zwischendurch"
 "zzzzzzz"


(TextHW2) julia> remove_words!(train_docs, removal)
ERROR: PCRE compilation error: regular expression is too large at offset 161853
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:35
  [2] compile(pattern::String, options::UInt32)
    @ Base.PCRE .\pcre.jl:155
  [3] compile(regex::Regex)
    @ Base .\regex.jl:82
  [4] Regex(pattern::String, compile_options::UInt32, match_options::UInt32)
    @ Base .\regex.jl:47
  [5] Regex
    @ .\regex.jl:70 [inlined]
  [6] mk_regex(regex_string::String)
    @ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:31
  [7] _combine_regex(regex_parts::Set{AbstractString})
    @ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:547
  [8] _build_regex(lang::Languages.English, flags::UInt32, patterns::Set{AbstractString}, words::Set{AbstractString})
    @ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:542
  [9] prepare!(crps::Corpus{StringDocument{String}}, flags::UInt32; skip_patterns::Set{AbstractString}, skip_words::Set{AbstractString})
    @ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:414
 [10] remove_words!(entity::Corpus{StringDocument{String}}, words::Vector{String})
    @ TextAnalysis C:\Users\aledo\.julia\packages\TextAnalysis\B0QxG\src\preprocessing.jl:227
 [11] top-level scope
    @ REPL[21]:1

This can be mostly avoided by individually iterating through the contents of the wanted removed words but is not perfect. Does anyone know why this error is happening?

jling · October 5, 2022, 3:58am

 regular expression is too large at offset 161853

basically by putting all words into a long regex it exceeded the limit…

nosewitz · October 5, 2022, 1:36pm

Thanks, I figured it out last night after realizing I could break up the removal array into chunks like 1:3000 and so on. Unfortunately the method does not allow for views so I’ll have to live with allocating these haha. Anyways thank you for the clarification!

Topic		Replies	Views
Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary Performance question	27	2196	December 15, 2021
Writing a fast nlp tokenizer in Julia Performance nlp	12	2387	February 2, 2021
Push!() tokendocument into corpus New to Julia question	3	354	November 23, 2021
ERROR: LoadError: PCRE JIT error: no more memory General Usage regex	4	165	May 8, 2024
Tokenising using TextAnalysis 0.8 Machine Learning text-analysis	1	77	September 19, 2024

Text Analysis API Error

Related topics