If inline strings work much faster (as they should), then a new (relatively simple) hybrid string type should have similar speed-up for variable-length strings. It’s on my radar, not yet implemented, and neither by anyone else that I know of. Such a package would be external to begin with (no performance downside, not just known to all), but could be integrated into Julia later. There’s not no need for Julia 2 (meaning a breaking syntax change) for this, or for most any speed increase I can think of.
I’ll have a look tomorrow, I’ve walked home from work now. I might need to spend my eloctrobike money on something
I got myself ready to start and I don’t know R so I don’t know exactly what this code does
I will propose a first draft of a Julia translation. I might need some time to try to get it apporiximatively right
or even some pseudo code
The real world use case is XML-annotation of named entities in a corpus of texts. There are 2 parts:
1. Creating a hashtable
reference2_ngrams
is probably not necessary, since creating just the hash-table myPattern
is enough,. However, using such a hashtable is imposed by the stringr::str_replace_all(texts,pattern)
function in R ; perhaps in Julia it would be easier to just find the pattern and prefix <l>
and postfix </l>
reference2_ngrams = a string array composed of 1 to 4 words per element, e.g. ["LLKSL PJHCV XLAJP CIIIK" , "IPZHD BSUPK" , "VBUBO" , "YYCFO DQWQR ZGKYO LLSZA", ....]
myPattern = a named vecotr of strings of the same lenght as refernence2_ngrams [ "LLKSL PJHCV XLAJP CIIIK":"<l>LLKSL PJHCV XLAJP CIIIK</l>", "IPZHD BSUPK":"<l>IPZHD BSUPK</l>", "VBUBO":"<l>VBUBO</l>" ...] # this is basically a hash table of n-grams to find and their replacement
2. Replacing all subphrases by their replacements given by the hashtable
Applxiong this to the textdata from the examples further up this Julia discourse topic
result = string_replace(textdata,maPattern.keys,myPattern.values)
My interpretation is
given
<xml>
<texts>
<text>llksl PJHCV xlajp CIIIK</text>
<text>IPZHD bsupk</text>
</texts>
</xml>
and
<xml>
<refs>
<ref>PJHCV</ref>
<ref>CIIIK</ref>
<ref>IPZHD</ref>
</refs>
</xml>
output
<xml>
<texts>
<text>llksl <l>PJHCV</l> xlajp <l>CIIIK</l></text>
<text><l>IPZHD</l> bsupk</text>
</texts>
</xml>
?
It could be, more simply:
given
["llksl PJHCV xlajp CIIIK","IPZHD bsupk"‘]
and
refs = ["PJHCV", "CIIIK","IPZHD"]
output
["llksl <l>PJHCV</l> xlajp <l>CIIIK</l>","<l>IPZHD</l> bsupk"]
I just have a database column of data and would like to produce a second column with annotated named entities.
Another use case would be:
input
textfile1.txt:
llksl PJHCV xlajp CIIIK
textfile2.txt:
IPZHD bsupk
output
xmlfile1.xml
<xml>
<metadata>...</metadata>
<text>
llksl <l>PJHCV</l> xlajp <l>CIIIK</l>
</text>
</xml>
xmlfile2.xml
<xml>
<metadata>...</metadata>
<text>
<l>IPZHD</l> bsupk
</text>
</xml>
But the central part is getting the original array of texts annotated with tags