Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary

If inline strings work much faster (as they should), then a new (relatively simple) hybrid string type should have similar speed-up for variable-length strings. It’s on my radar, not yet implemented, and neither by anyone else that I know of. Such a package would be external to begin with (no performance downside, not just known to all), but could be integrated into Julia later. There’s not no need for Julia 2 (meaning a breaking syntax change) for this, or for most any speed increase I can think of.

I’ll have a look tomorrow, I’ve walked home from work now. I might need to spend my eloctrobike money on something :smile:

I got myself ready to start and I don’t know R so I don’t know exactly what this code does

I will propose a first draft of a Julia translation. I might need some time to try to get it apporiximatively right

or even some pseudo code :slight_smile:

The real world use case is XML-annotation of named entities in a corpus of texts. There are 2 parts:

1. Creating a hashtable

reference2_ngrams is probably not necessary, since creating just the hash-table myPattern is enough,. However, using such a hashtable is imposed by the stringr::str_replace_all(texts,pattern) function in R ; perhaps in Julia it would be easier to just find the pattern and prefix <l> and postfix </l>

reference2_ngrams = a string array composed of 1 to 4 words per element, e.g. ["LLKSL PJHCV XLAJP CIIIK" , "IPZHD BSUPK" , "VBUBO" , "YYCFO DQWQR ZGKYO LLSZA", ....]
myPattern = a named vecotr of strings of the same lenght as refernence2_ngrams [ "LLKSL PJHCV XLAJP CIIIK":"<l>LLKSL PJHCV XLAJP CIIIK</l>",  "IPZHD BSUPK":"<l>IPZHD BSUPK</l>",  "VBUBO":"<l>VBUBO</l>" ...] # this is basically a hash table of n-grams to find and their replacement 

2. Replacing all subphrases by their replacements given by the hashtable

Applxiong this to the textdata from the examples further up this Julia discourse topic

result = string_replace(textdata,maPattern.keys,myPattern.values)

My interpretation is

given

<xml>
 <texts>
  <text>llksl PJHCV xlajp CIIIK</text>
  <text>IPZHD bsupk</text>
 </texts>
</xml>

and

<xml>
 <refs>
  <ref>PJHCV</ref>
  <ref>CIIIK</ref>
  <ref>IPZHD</ref>
 </refs>
</xml>

output

<xml>
 <texts>
  <text>llksl <l>PJHCV</l> xlajp <l>CIIIK</l></text>
  <text><l>IPZHD</l> bsupk</text>
 </texts>
</xml>

?

It could be, more simply:
given

["llksl PJHCV xlajp CIIIK","IPZHD bsupk"‘]

and

refs = ["PJHCV", "CIIIK","IPZHD"]

output

["llksl <l>PJHCV</l> xlajp <l>CIIIK</l>","<l>IPZHD</l> bsupk"]

I just have a database column of data and would like to produce a second column with annotated named entities.

Another use case would be:

input

textfile1.txt:

llksl PJHCV xlajp CIIIK

textfile2.txt:

IPZHD bsupk

output

xmlfile1.xml

<xml>
 <metadata>...</metadata>
 <text>
  llksl <l>PJHCV</l> xlajp <l>CIIIK</l>
 </text>
</xml>

xmlfile2.xml

<xml>
 <metadata>...</metadata>
 <text>
  <l>IPZHD</l> bsupk
 </text>
</xml>

But the central part is getting the original array of texts annotated with tags