Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary

Palli · December 7, 2021, 5:37pm

If inline strings work much faster (as they should), then a new (relatively simple) hybrid string type should have similar speed-up for variable-length strings. It’s on my radar, not yet implemented, and neither by anyone else that I know of. Such a package would be external to begin with (no performance downside, not just known to all), but could be integrated into Julia later. There’s not no need for Julia 2 (meaning a breaking syntax change) for this, or for most any speed increase I can think of.

lawless-m · December 7, 2021, 5:43pm

I’ll have a look tomorrow, I’ve walked home from work now. I might need to spend my eloctrobike money on something

lawless-m · December 8, 2021, 2:35pm

I got myself ready to start and I don’t know R so I don’t know exactly what this code does

aourednik · December 10, 2021, 2:32pm

I will propose a first draft of a Julia translation. I might need some time to try to get it apporiximatively right

lawless-m · December 10, 2021, 2:37pm

or even some pseudo code

aourednik · December 10, 2021, 5:52pm

The real world use case is XML-annotation of named entities in a corpus of texts. There are 2 parts:

1. Creating a hashtable

reference2_ngrams is probably not necessary, since creating just the hash-table myPattern is enough,. However, using such a hashtable is imposed by the stringr::str_replace_all(texts,pattern) function in R ; perhaps in Julia it would be easier to just find the pattern and prefix <l> and postfix </l>

reference2_ngrams = a string array composed of 1 to 4 words per element, e.g. ["LLKSL PJHCV XLAJP CIIIK" , "IPZHD BSUPK" , "VBUBO" , "YYCFO DQWQR ZGKYO LLSZA", ....]
myPattern = a named vecotr of strings of the same lenght as refernence2_ngrams [ "LLKSL PJHCV XLAJP CIIIK":"<l>LLKSL PJHCV XLAJP CIIIK</l>",  "IPZHD BSUPK":"<l>IPZHD BSUPK</l>",  "VBUBO":"<l>VBUBO</l>" ...] # this is basically a hash table of n-grams to find and their replacement

2. Replacing all subphrases by their replacements given by the hashtable

Applxiong this to the textdata from the examples further up this Julia discourse topic

result = string_replace(textdata,maPattern.keys,myPattern.values)

lawless-m · December 13, 2021, 1:43pm

My interpretation is

given

<xml>
 <texts>
  <text>llksl PJHCV xlajp CIIIK</text>
  <text>IPZHD bsupk</text>
 </texts>
</xml>

and

<xml>
 <refs>
  <ref>PJHCV</ref>
  <ref>CIIIK</ref>
  <ref>IPZHD</ref>
 </refs>
</xml>

output

<xml>
 <texts>
  <text>llksl <l>PJHCV</l> xlajp <l>CIIIK</l></text>
  <text><l>IPZHD</l> bsupk</text>
 </texts>
</xml>

?

aourednik · December 15, 2021, 12:39am

It could be, more simply:
given

["llksl PJHCV xlajp CIIIK","IPZHD bsupk"‘]

and

refs = ["PJHCV", "CIIIK","IPZHD"]

output

["llksl <l>PJHCV</l> xlajp <l>CIIIK</l>","<l>IPZHD</l> bsupk"]

I just have a database column of data and would like to produce a second column with annotated named entities.

Another use case would be:

input

textfile1.txt:

llksl PJHCV xlajp CIIIK

textfile2.txt:

IPZHD bsupk

output

xmlfile1.xml

<xml>
 <metadata>...</metadata>
 <text>
  llksl <l>PJHCV</l> xlajp <l>CIIIK</l>
 </text>
</xml>

xmlfile2.xml

<xml>
 <metadata>...</metadata>
 <text>
  <l>IPZHD</l> bsupk
 </text>
</xml>

But the central part is getting the original array of texts annotated with tags

Topic		Replies	Views
How do I make the julia code efficient? General Usage question	3	300	September 21, 2022
Read array of strings into Dictionary of DataFrames Performance question , dataframes	1	342	July 17, 2020
Bad performance: using OOP with for loops to identify substring New to Julia performance	8	589	January 7, 2022
Fast String processing in Julia vs. Python General Usage	38	6853	March 7, 2017
Community string benchmark suite General Usage	3	161	October 1, 2024

Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary

1. Creating a hashtable

2. Replacing all subphrases by their replacements given by the hashtable

input

output

Related topics