Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary

I am trying to port a multithreaded #RStats algorithm for detecting strings from a 200k long dictionary in a corpus of 1 million phrases (cf. Text Mining: Detect Strings: Very Fast Word Lookup in a Large Dictionary in R with data.table and matrixStats – Maps and Spaces).

So far, my R code (calling Cpp libraries data.table and matrixStats) is much faster. Can you help improve the Julia translation ? I.e. optimize the last two lines of the code below.

## Imports
import Pkg; Pkg.add("DataFrames")
using Random, DataFrames, StatsBase

## Preparing random text sample and dictionary
mseq = 1:1000000
reference = String[]
textdata = DataFrame(
    id = mseq,
    textlong = "",
    occurs = 0
) 
Threads.@threads for i in mseq
    push!(reference,randstring('A':'Z', 5))
    textdata[i,"textlong"] = join([randstring('A':'Z', 5) for i in 1:rand(1:10)]," ")
end
reference = unique(reference)
reference = reference[1:200000]

## Running the detection
### for each reference entry generate a bytestring vector of length = nrows of textdata matrix 
### then sums vectors 
textdata.occurs  = [occursin.(x,textdata.textlong) for x = reference] |> sum 
result = filter(row -> row.occurs > 0 ,textdata)

Suggested help by Szymon Bęczkowski :

I don’t know how to use it yet

I’ve moved this out of the “Optimization (Mathematical)” section to “Performance”.

You should start by reading the Julia performance tips:

https://docs.julialang.org/en/v1/manual/performance-tips/

Followed some performance tips with no noticeable change for any of those I managed to use. Either they have no effect or don’t seem to apply to my use-case.
It is very hard to understand how to do promise / await, like in JS; also no async-safe package for array-pushing like foreach in R.
I’m giving up and returning to R for now.
Perhaps in Julia 2 code will be simpler to optimize and more higher-level wrappers will be around. The version below is the best I’ve got so far.

import Pkg; Pkg.add("DataFrames")
Threads.nthreads()

## Preparing random text sample and dictionary a
### This part only takes less than a second
textdata_ids = 1:100000
textdata_textlong = String[] ### Types should be declared for performance
textdata_occurs = Int[]
reference = String[]
using Random
for i in textdata_ids
    push!(reference,randstring('A':'Z', 5))
    push!(textdata_textlong,join([randstring('A':'Z', 5) for j in 1:rand(1:10)]," "))
end
reference = unique(reference)
reference = reference[1:2000]

## Define the detection function and run it
### "Any code that is performance critical should be inside a function. Code inside functions tends to run much faster than top level code, due to how Julia's compiler works."
function detectwordinphrase(words::Vector{String},phrases::Vector{String})
    # occurrences = [occursin.(x,phrases) for x::String = words]
    # occurrences::Vector{BitVector} = [occursin.(x,phrases) for x::String = words]
    # occurrences = Vector{BitVector}(undef,length(words))
    occurrences = BitVector[]
    for i = 1:size(words,1) # This loop takes all of the time
            push!(occurrences,occursin.(words[i],phrases)) # LoadError: UndefRefError: access to undefined reference
        # occurrences[i] = occursin.(words[i],phrases)
    end
    # wait(occurrences)
    return sum(occurrences) # This takes no time at all
end;

##
# Running the detection:
@time textdata_occurs  = detectwordinphrase(reference,textdata_textlong)
# 10k phrases 2k words : 0.929850 seconds (40.01 M allocations: 2.544 GiB, 7.86% gc time)
# 100k phrases 2k words : 11.996917 seconds (400.01 M allocations: 25.363 GiB, 7.66% gc time)
# 100k phrases 20k words : 112.146993 seconds (4.00 G allocations: 253.637 GiB, 11.36% gc time)

using DataFrames
textdata=DataFrame(
    ids = textdata_ids, 
    textlong = textdata_textlong,
    occurs = textdata_occurs 
)
result = filter(row -> row.occurs > 0 ,textdata)

You’re looking for:
https://docs.julialang.org/en/v1/stdlib/Distributed/#Distributed.pmap

Start Julia with julia -p N where N is the number of processes you want to use (e.g., julia -p 8). Then perhaps something like:

(base) oscar@Oscars-MBP /tmp % ~/julia --project=df -p 8
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.2 (2021-07-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Distributed

julia> using Random

julia> function generate_data(N, M)
           textdata_ids = 1:N
           textdata_textlong = String[]
           reference = String[]
           for i in textdata_ids
               push!(reference, randstring('A':'Z', 5))
               push!(textdata_textlong, join([randstring('A':'Z', 5) for j in 1:rand(1:10)]," "))
           end
           reference = unique(reference)
           reference = reference[1:M]
           return reference, textdata_textlong
       end
generate_data (generic function with 1 method)

julia> function detectwordinphrase(words::Vector{String}, phrases::Vector{String})
           pool = CachingPool(workers())
           return let words = words
               pmap(pool, phrases) do phrase
                   return sum(occursin(word, phrase) for word in words)
               end
           end
       end
detectwordinphrase (generic function with 1 method)

julia> reference, textdata_textlong = generate_data(100_000, 20_000);

julia> @time textdata_occurs = detectwordinphrase(reference, textdata_textlong);
 38.472972 seconds (8.56 M allocations: 343.730 MiB, 0.38% gc time, 1.70% compilation time)
1 Like

I thought I would take this on as a challenge and there was a surpise in store.

It’s not exactly related to performance but … (well it is on single threads because push! is slower - see below)

push! is not threadsafe

julia> Threads.nthreads()
40
julia> a = Vector{String}()
julia> Threads.@threads for i in 1:1_000_000
           push!(a, "$i")
          end
Segmentation fault   

and even if it doesn’t crash (which is likely as it does take me about 1m pushes to make it crash)
the result is undefined


julia> a = Vector{String}()
julia> Threads.@threads for i in 1:1_000
         push!(a, "$i")
       end

julia> length(a)
978

A correct approach is to pre-allocate the vector, and this is also a performance tip, as push! is much slower in single threaded than doing pre-allocation

julia> a = Vector{String}(undef, 1_000_000);

julia> Threads.@threads for i in 1:1_000_000
         a[i] = "$i"
       end 
julia> 
1 Like

Many thanks!
I did run Julia with 4 threads. I’m using the Julia language support in VScode and discovered it has a nice ‘Num Threads’ parameter in the GUI extension parameters.

The function below was giving me errors yesterday night, though, for a mysterious reason, now it runs, with the exec time halved:

function detectwordinphrase(words::Vector{String},phrases::Vector{String})
   occurrences = Vector{BitVector}(undef,length(words))
   Threads.@threads for i = 1:size(words,1) 
      occurrences[i] = occursin.(words[i],phrases)
   end
   return sum(occurrences) 
end;
@time textdata_occurs  = detectwordinphrase(reference,textdata_textlong)
# 100k phrases 2k words : 5.061899 seconds (400.01 M allocations: 25.363 GiB, 37.20% gc time)
# 100k phrases 20k words : 50.659987 seconds (4.00 G allocations: 253.655 GiB, 35.53% gc time, 0.12% compilation time)

Does putting the data generation in a function really help the detection function? It already was the quick part of the code, despite reference and textdata_textlong being globals. Are they not globals if using

reference, textdata_textlong = generate_data(100_000, 20_000);

?

Threads.@threads for... seems faster than pool = CachingPool(workers()) on my architecture (Apple Silicon M1 arm64)

## Detection 1
### "Any code that is performance critical should be inside a function. Code inside functions tends to run much faster than top level code, due to how Julia's compiler works."
function detectwordinphrase(words::Vector{String},phrases::Vector{String})
    # occurrences = [occursin.(x,phrases) for x::String = words]
    occurrences = Vector{BitVector}(undef,length(words))
    Threads.@threads for i = 1:size(words,1) # This loop takes all of the time
        occurrences[i] = occursin.(words[i],phrases)
    end
    return sum(occurrences) # This takes no time at all
end;
@time textdata_occurs  = detectwordinphrase(reference,textdata_textlong)
# Single threaded:
# 10k phrases 2k words : 0.929850 seconds (40.01 M allocations: 2.544 GiB, 7.86% gc time)
# 100k phrases 2k words : 11.996917 seconds (400.01 M allocations: 25.363 GiB, 7.66% gc time)
# 100k phrases 20k words : 112.146993 seconds (4.00 G allocations: 253.637 GiB, 11.36% gc time)
# Multithreaded 4 threads
# 100k phrases 2k words : 5.061899 seconds (400.01 M allocations: 25.363 GiB, 37.20% gc time)
# 100k phrases 20k words : 50.659987 seconds (4.00 G allocations: 253.655 GiB, 35.53% gc time)
# Multithreaded 8 Threads
# 100k phrases 20k words :  37.712681 seconds (4.00 G allocations: 253.655 GiB, 51.64% gc time, 0.16% compilation time)

## Detection 2:
using Distributed
function detectwordinphrase2(words::Vector{String}, phrases::Vector{String})
    pool = CachingPool(workers())
    return let words = words
        pmap(pool, phrases) do phrase
            return sum(occursin(word, phrase) for word in words)
        end
    end
end
@time textdata_occurs  = detectwordinphrase2(reference,textdata_textlong)
# Multithreaded 4 Threads
# 100k phrases 20k words : 88.694093 seconds (4.00 G allocations: 238.460 GiB, 21.67% gc time, 0.18% compilation time)
# Multithreaded 8 Threads
# 100k phrases 20k words :  90.496826 seconds (4.00 G allocations: 238.460 GiB, 21.99% gc time, 0.17% compilation time)

But still nowhere near <2 secs for 1M phrases and 200k words in the R version (Text Mining: Detect Strings: Very Fast Word Lookup in a Large Dictionary in R with data.table and matrixStats – Maps and Spaces)

One question, you’re using occursin which finds “ea” in “bread” but your words are all 5 chars long, so == is much quicker.

I appreciate your real problem probably doesn’t have fixed length strings but do you really need to search the substrings ?

I’ve been lazy in the construction of the random dataset. My real world applicaiton has variable string lengths.
If == can handle them, I take it.

The operation that made my R code fast was the as.matrix(splitedv) %>% rowAnys, i.e. making operations on whole vectors ( for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% dict)) , than summing, “row-wise” the columns of the matrix. rowAnys is actually boolean. It comes from the powerfull matrixStats package. The %chin% part, equivalent to occursin comes from data.table.

detectnameinstring <- function(x,dict) {
  splitedv <- tstrsplit(x,"[ ]") %>% as.data.table
  cols <- colnames(splitedv)
  for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% dict)
  return(as.matrix(splitedv) %>% rowAnys)
}

Can you try InlineStrings instead of plain strings?

I could but the function needs to handle strings with variable lengths.
They only have fixed lengths in my random data

Out of interest, how long does R take to count the number of occurrences of 200_000 in 1_000_000 ?

With == and 40 threads I’ve got it down to

julia> goA(1_000_000, 200_000)

502.911732 seconds (36.10 k allocations: 11.112 MiB, 0.01% compilation time)

R does occurence count of 200k words in 1M phrases in 11 seconds on a Apple M1 Pro, even without using explicit mulithreading (though data.table reports running 5 threads under the hood). To count instead of just detecting (true/false), one can just replace rowAnys by rowSums.

Strangely enough, what takes R long is the preparation of the random dataset, that Julia handles in less than a second. But detection is incomparably faster

R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20.6.0 (64-bit)
> library("data.table")
data.table 1.14.2 using 5 threads (see ?getDTthreads).  Latest news: r-datatable.com
> library("magrittr")
> library("matrixStats")
> library("stringr")
> randomString <- function(n = 5000) {
+   a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
+   paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
+   return(a)
+ }
> text.data <- data.table(id = 1:1000000)
> system.time({
+   text.data[, text := {randomString(nrow(text.data))} ]
+ })
   user  system elapsed 
  1.718   0.038   1.756 
> system.time({
+   text.data[, textlong := sapply(id, function(x) randomString(round(runif(1,1,10))) %>% paste(.,collapse=" ")) ]
+ })
   user  system elapsed 
 70.678   0.858  71.513 
> reference  <- data.table(ref = sample(unique(text.data$text),200000))
> countnameinstring <- function(x,dict) {
+   splitedv <- tstrsplit(x,"[ ]") %>% as.data.table
+   cols <- colnames(splitedv)
+   for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% dict)
+   return(as.matrix(splitedv) %>% rowSums)
+ }
> 
> system.time({
+   text.data[ , dicmatchinphrase2 := countnameinstring(textlong,reference$ref)]
+ }) 
   user  system elapsed 
 11.935   0.079  12.010 

ouch

Let’s do things properly (I hope!)

1_000_000 & 200_000 in 0.7 secs, single threaded

I noticed the R does ==
So I used Set and in

function fill(n)
    ## Preparing random text sample and dictionary
    searchterms = Vector{String}(undef, n)
    textdata = Vector{Vector{String}}(undef, n)
    Threads.@threads for i in 1:n
        searchterms[i] = randstring('A':'Z', 5)
        textdata[i] = [randstring('A':'Z', 5) for _ in 1:rand(1:10)]
    end
    textdata, unique(searchterms)
end

function processT(textdata, searchset)
    n = size(textdata, 1)
    occurs = Vector{Int}(undef, n)
    Threads.@threads for i in 1:n
        occurs[i] = map(t-> t in searchset ? 1 : 0, textdata[i]) |> sum
    end
    occurs
end

function processS(textdata, searchset)
    n = size(textdata, 1)
    occurs = Vector{Int}(undef, n)
    for i in 1:n
        occurs[i] = map(t-> t in searchset ? 1 : 0, textdata[i]) |> sum
    end
    occurs
end

@time textdata, searchterms = fill(1_000_000);
                                                                     
1.287460 seconds (14.00 M allocations: 845.350 MiB, 58.86% gc time)  

@time searchset = Set(searchterms[1:200_000]);
 0.022542 seconds (10 allocations: 6.026 MiB)
@time result = processT(textdata, searchset); # 40 threads
  0.033092 seconds (1.00 M allocations: 106.820 MiB)
@time result = processS(textdata, searchset); # single threaded
 0.722141 seconds (1.00 M allocations: 106.801 MiB)
6 Likes

Very nice!

Event the single-threaded Julia version runs faster than the 8-threaded R call to my previous countnameinstring R function !

> countnameinstringarallel <- function(x,dict,rowsplit) {
+   if (length(x)<rowsplit) {
+     return(countnameinstring(x,dict))
+   } else {
+     myseq <- seq(0,length(x),rowsplit)
+     if( length(x) %% rowsplit > 0) myseq <- c(myseq,length(x))
+     cl <- parallel::makeCluster(8) 
+     doParallel::registerDoParallel(cl)
+     foreach(
+       i=1:(length(myseq)-1),
+       .packages = c("stringr","data.table","matrixStats"),
+       .export = c("dict","countnameinstring")
+     ) %dopar% {
+       countnameinstring(x[(myseq[i]+1):(myseq[i+1])],dict)
+     } %>% unlist
+   }
+ }
> 
> system.time({
+   text.data[ , dicmatchinphrase := countnameinstringarallel(textlong,reference$ref,10000)]
+ }) 
   user  system elapsed 
  0.840   0.286   8.538 
2 Likes

The Julia version on your M1 Pro might be even quicker than my creaky old Xeon :slight_smile:

We’ve seen benchmarks where the M1 is 2x-3x faster, even using all my 40 cores vs M1’s 8 cores

Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

:grinning:
Riding an eloctrobike in the city, I don’t have a car, so I bought myself one of those:

julia version 1.7.0
Model Identifier: MacBookPro18,1
Chip: Apple M1 Pro
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 32 GB
System Version: macOS 12.0.1 (21A559)
Kernel Version: Darwin 21.1.0

With your code:

@time textdata, searchterms = fill(1_000_000);
# 0.543755 seconds (14.22 M allocations: 858.079 MiB, 34.46% gc time, 5.28% compilation time)
@time searchset = Set(searchterms[1:200_000]);
# 0.010282 seconds (10 allocations: 6.026 MiB)
@time result = processT(textdata, searchset); # 8 threads
# 0.079716 seconds (1.12 M allocations: 113.163 MiB, 31.10% compilation time)
@time result = processS(textdata, searchset); # single threaded
# 0.439186 seconds (1.07 M allocations: 110.820 MiB, 3.45% compilation time)
1 Like

The next nut to crack would be string replacement i.e., to port this one:
replaces parts of text (xml annotation) when an n_gram from a 200k dictionary is found in those 1 million texts.

Should I start a new topic or leave it here, since working on the same random data?

# prepare a second, longer dictionary with ngrams (1-gram to 4-gram) ----
reference2_ngrams <- text.data$textlong %>% sapply(.,function(x){word(x,sample(1:2,1),sample(1:4,1))}) %>% .[!is.na(.) & str_length(.)>0 ] %>% unique %>% .[1:200000]

# replacement pattern
myPattern <- paste0("<l>",reference2_ngrams,"</l>") %>% setNames(.,reference2_ngrams)

system.time({
	text.data[1:100,textAnnotated := str_replace_all(
	  textlong,
	  fixed(myPattern)
	)]
})
#    user  system elapsed 
#  8.456   0.468   8.940