Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary

aourednik · December 6, 2021, 5:04pm

I am trying to port a multithreaded #RStats algorithm for detecting strings from a 200k long dictionary in a corpus of 1 million phrases (cf. Text Mining: Detect Strings: Very Fast Word Lookup in a Large Dictionary in R with data.table and matrixStats – Maps and Spaces).

So far, my R code (calling Cpp libraries data.table and matrixStats) is much faster. Can you help improve the Julia translation ? I.e. optimize the last two lines of the code below.

## Imports
import Pkg; Pkg.add("DataFrames")
using Random, DataFrames, StatsBase

## Preparing random text sample and dictionary
mseq = 1:1000000
reference = String[]
textdata = DataFrame(
    id = mseq,
    textlong = "",
    occurs = 0
) 
Threads.@threads for i in mseq
    push!(reference,randstring('A':'Z', 5))
    textdata[i,"textlong"] = join([randstring('A':'Z', 5) for i in 1:rand(1:10)]," ")
end
reference = unique(reference)
reference = reference[1:200000]

## Running the detection
### for each reference entry generate a bytestring vector of length = nrows of textdata matrix 
### then sums vectors 
textdata.occurs  = [occursin.(x,textdata.textlong) for x = reference] |> sum 
result = filter(row -> row.occurs > 0 ,textdata)

aourednik · December 6, 2021, 5:09pm

Suggested help by Szymon Bęczkowski :

I don’t know how to use it yet

odow · December 6, 2021, 10:18pm

I’ve moved this out of the “Optimization (Mathematical)” section to “Performance”.

You should start by reading the Julia performance tips:

https://docs.julialang.org/en/v1/manual/performance-tips/

aourednik · December 7, 2021, 4:07am

Followed some performance tips with no noticeable change for any of those I managed to use. Either they have no effect or don’t seem to apply to my use-case.
It is very hard to understand how to do promise / await, like in JS; also no async-safe package for array-pushing like foreach in R.
I’m giving up and returning to R for now.
Perhaps in Julia 2 code will be simpler to optimize and more higher-level wrappers will be around. The version below is the best I’ve got so far.

import Pkg; Pkg.add("DataFrames")
Threads.nthreads()

## Preparing random text sample and dictionary a
### This part only takes less than a second
textdata_ids = 1:100000
textdata_textlong = String[] ### Types should be declared for performance
textdata_occurs = Int[]
reference = String[]
using Random
for i in textdata_ids
    push!(reference,randstring('A':'Z', 5))
    push!(textdata_textlong,join([randstring('A':'Z', 5) for j in 1:rand(1:10)]," "))
end
reference = unique(reference)
reference = reference[1:2000]

## Define the detection function and run it
### "Any code that is performance critical should be inside a function. Code inside functions tends to run much faster than top level code, due to how Julia's compiler works."
function detectwordinphrase(words::Vector{String},phrases::Vector{String})
    # occurrences = [occursin.(x,phrases) for x::String = words]
    # occurrences::Vector{BitVector} = [occursin.(x,phrases) for x::String = words]
    # occurrences = Vector{BitVector}(undef,length(words))
    occurrences = BitVector[]
    for i = 1:size(words,1) # This loop takes all of the time
            push!(occurrences,occursin.(words[i],phrases)) # LoadError: UndefRefError: access to undefined reference
        # occurrences[i] = occursin.(words[i],phrases)
    end
    # wait(occurrences)
    return sum(occurrences) # This takes no time at all
end;

##
# Running the detection:
@time textdata_occurs  = detectwordinphrase(reference,textdata_textlong)
# 10k phrases 2k words : 0.929850 seconds (40.01 M allocations: 2.544 GiB, 7.86% gc time)
# 100k phrases 2k words : 11.996917 seconds (400.01 M allocations: 25.363 GiB, 7.66% gc time)
# 100k phrases 20k words : 112.146993 seconds (4.00 G allocations: 253.637 GiB, 11.36% gc time)

using DataFrames
textdata=DataFrame(
    ids = textdata_ids, 
    textlong = textdata_textlong,
    occurs = textdata_occurs 
)
result = filter(row -> row.occurs > 0 ,textdata)

odow · December 7, 2021, 5:09am

You’re looking for:
https://docs.julialang.org/en/v1/stdlib/Distributed/#Distributed.pmap

Start Julia with julia -p N where N is the number of processes you want to use (e.g., julia -p 8). Then perhaps something like:

(base) oscar@Oscars-MBP /tmp % ~/julia --project=df -p 8
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.2 (2021-07-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Distributed

julia> using Random

julia> function generate_data(N, M)
           textdata_ids = 1:N
           textdata_textlong = String[]
           reference = String[]
           for i in textdata_ids
               push!(reference, randstring('A':'Z', 5))
               push!(textdata_textlong, join([randstring('A':'Z', 5) for j in 1:rand(1:10)]," "))
           end
           reference = unique(reference)
           reference = reference[1:M]
           return reference, textdata_textlong
       end
generate_data (generic function with 1 method)

julia> function detectwordinphrase(words::Vector{String}, phrases::Vector{String})
           pool = CachingPool(workers())
           return let words = words
               pmap(pool, phrases) do phrase
                   return sum(occursin(word, phrase) for word in words)
               end
           end
       end
detectwordinphrase (generic function with 1 method)

julia> reference, textdata_textlong = generate_data(100_000, 20_000);

julia> @time textdata_occurs = detectwordinphrase(reference, textdata_textlong);
 38.472972 seconds (8.56 M allocations: 343.730 MiB, 0.38% gc time, 1.70% compilation time)

lawless-m · December 7, 2021, 10:19am

I thought I would take this on as a challenge and there was a surpise in store.

It’s not exactly related to performance but … (well it is on single threads because push! is slower - see below)

push! is not threadsafe

julia> Threads.nthreads()
40
julia> a = Vector{String}()
julia> Threads.@threads for i in 1:1_000_000
           push!(a, "$i")
          end
Segmentation fault

and even if it doesn’t crash (which is likely as it does take me about 1m pushes to make it crash)
the result is undefined


julia> a = Vector{String}()
julia> Threads.@threads for i in 1:1_000
         push!(a, "$i")
       end

julia> length(a)
978

A correct approach is to pre-allocate the vector, and this is also a performance tip, as push! is much slower in single threaded than doing pre-allocation

julia> a = Vector{String}(undef, 1_000_000);

julia> Threads.@threads for i in 1:1_000_000
         a[i] = "$i"
       end 
julia>

aourednik · December 7, 2021, 10:59am

Many thanks!
I did run Julia with 4 threads. I’m using the Julia language support in VScode and discovered it has a nice ‘Num Threads’ parameter in the GUI extension parameters.

The function below was giving me errors yesterday night, though, for a mysterious reason, now it runs, with the exec time halved:

function detectwordinphrase(words::Vector{String},phrases::Vector{String})
   occurrences = Vector{BitVector}(undef,length(words))
   Threads.@threads for i = 1:size(words,1) 
      occurrences[i] = occursin.(words[i],phrases)
   end
   return sum(occurrences) 
end;
@time textdata_occurs  = detectwordinphrase(reference,textdata_textlong)
# 100k phrases 2k words : 5.061899 seconds (400.01 M allocations: 25.363 GiB, 37.20% gc time)
# 100k phrases 20k words : 50.659987 seconds (4.00 G allocations: 253.655 GiB, 35.53% gc time, 0.12% compilation time)

Does putting the data generation in a function really help the detection function? It already was the quick part of the code, despite reference and textdata_textlong being globals. Are they not globals if using

reference, textdata_textlong = generate_data(100_000, 20_000);

?

aourednik · December 7, 2021, 11:22am

Threads.@threads for... seems faster than pool = CachingPool(workers()) on my architecture (Apple Silicon M1 arm64)

## Detection 1
### "Any code that is performance critical should be inside a function. Code inside functions tends to run much faster than top level code, due to how Julia's compiler works."
function detectwordinphrase(words::Vector{String},phrases::Vector{String})
    # occurrences = [occursin.(x,phrases) for x::String = words]
    occurrences = Vector{BitVector}(undef,length(words))
    Threads.@threads for i = 1:size(words,1) # This loop takes all of the time
        occurrences[i] = occursin.(words[i],phrases)
    end
    return sum(occurrences) # This takes no time at all
end;
@time textdata_occurs  = detectwordinphrase(reference,textdata_textlong)
# Single threaded:
# 10k phrases 2k words : 0.929850 seconds (40.01 M allocations: 2.544 GiB, 7.86% gc time)
# 100k phrases 2k words : 11.996917 seconds (400.01 M allocations: 25.363 GiB, 7.66% gc time)
# 100k phrases 20k words : 112.146993 seconds (4.00 G allocations: 253.637 GiB, 11.36% gc time)
# Multithreaded 4 threads
# 100k phrases 2k words : 5.061899 seconds (400.01 M allocations: 25.363 GiB, 37.20% gc time)
# 100k phrases 20k words : 50.659987 seconds (4.00 G allocations: 253.655 GiB, 35.53% gc time)
# Multithreaded 8 Threads
# 100k phrases 20k words :  37.712681 seconds (4.00 G allocations: 253.655 GiB, 51.64% gc time, 0.16% compilation time)

## Detection 2:
using Distributed
function detectwordinphrase2(words::Vector{String}, phrases::Vector{String})
    pool = CachingPool(workers())
    return let words = words
        pmap(pool, phrases) do phrase
            return sum(occursin(word, phrase) for word in words)
        end
    end
end
@time textdata_occurs  = detectwordinphrase2(reference,textdata_textlong)
# Multithreaded 4 Threads
# 100k phrases 20k words : 88.694093 seconds (4.00 G allocations: 238.460 GiB, 21.67% gc time, 0.18% compilation time)
# Multithreaded 8 Threads
# 100k phrases 20k words :  90.496826 seconds (4.00 G allocations: 238.460 GiB, 21.99% gc time, 0.17% compilation time)

But still nowhere near <2 secs for 1M phrases and 200k words in the R version (Text Mining: Detect Strings: Very Fast Word Lookup in a Large Dictionary in R with data.table and matrixStats – Maps and Spaces)

lawless-m · December 7, 2021, 11:31am

One question, you’re using occursin which finds “ea” in “bread” but your words are all 5 chars long, so == is much quicker.

I appreciate your real problem probably doesn’t have fixed length strings but do you really need to search the substrings ?

aourednik · December 7, 2021, 11:38am

I’ve been lazy in the construction of the random dataset. My real world applicaiton has variable string lengths.
If == can handle them, I take it.

The operation that made my R code fast was the as.matrix(splitedv) %>% rowAnys, i.e. making operations on whole vectors ( for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% dict)) , than summing, “row-wise” the columns of the matrix. rowAnys is actually boolean. It comes from the powerfull matrixStats package. The %chin% part, equivalent to occursin comes from data.table.

detectnameinstring <- function(x,dict) {
  splitedv <- tstrsplit(x,"[ ]") %>% as.data.table
  cols <- colnames(splitedv)
  for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% dict)
  return(as.matrix(splitedv) %>% rowAnys)
}

nilshg · December 7, 2021, 11:39am

Can you try InlineStrings instead of plain strings?

aourednik · December 7, 2021, 11:46am

I could but the function needs to handle strings with variable lengths.
They only have fixed lengths in my random data

lawless-m · December 7, 2021, 12:14pm

Out of interest, how long does R take to count the number of occurrences of 200_000 in 1_000_000 ?

With == and 40 threads I’ve got it down to

julia> goA(1_000_000, 200_000)

502.911732 seconds (36.10 k allocations: 11.112 MiB, 0.01% compilation time)

aourednik · December 7, 2021, 2:55pm

R does occurence count of 200k words in 1M phrases in 11 seconds on a Apple M1 Pro, even without using explicit mulithreading (though data.table reports running 5 threads under the hood). To count instead of just detecting (true/false), one can just replace rowAnys by rowSums.

Strangely enough, what takes R long is the preparation of the random dataset, that Julia handles in less than a second. But detection is incomparably faster

R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20.6.0 (64-bit)
> library("data.table")
data.table 1.14.2 using 5 threads (see ?getDTthreads).  Latest news: r-datatable.com
> library("magrittr")
> library("matrixStats")
> library("stringr")
> randomString <- function(n = 5000) {
+   a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
+   paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
+   return(a)
+ }
> text.data <- data.table(id = 1:1000000)
> system.time({
+   text.data[, text := {randomString(nrow(text.data))} ]
+ })
   user  system elapsed 
  1.718   0.038   1.756 
> system.time({
+   text.data[, textlong := sapply(id, function(x) randomString(round(runif(1,1,10))) %>% paste(.,collapse=" ")) ]
+ })
   user  system elapsed 
 70.678   0.858  71.513 
> reference  <- data.table(ref = sample(unique(text.data$text),200000))
> countnameinstring <- function(x,dict) {
+   splitedv <- tstrsplit(x,"[ ]") %>% as.data.table
+   cols <- colnames(splitedv)
+   for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% dict)
+   return(as.matrix(splitedv) %>% rowSums)
+ }
> 
> system.time({
+   text.data[ , dicmatchinphrase2 := countnameinstring(textlong,reference$ref)]
+ }) 
   user  system elapsed 
 11.935   0.079  12.010

lawless-m · December 7, 2021, 2:58pm

ouch

lawless-m · December 7, 2021, 3:28pm

Let’s do things properly (I hope!)

1_000_000 & 200_000 in 0.7 secs, single threaded

I noticed the R does ==
So I used Set and in

function fill(n)
    ## Preparing random text sample and dictionary
    searchterms = Vector{String}(undef, n)
    textdata = Vector{Vector{String}}(undef, n)
    Threads.@threads for i in 1:n
        searchterms[i] = randstring('A':'Z', 5)
        textdata[i] = [randstring('A':'Z', 5) for _ in 1:rand(1:10)]
    end
    textdata, unique(searchterms)
end

function processT(textdata, searchset)
    n = size(textdata, 1)
    occurs = Vector{Int}(undef, n)
    Threads.@threads for i in 1:n
        occurs[i] = map(t-> t in searchset ? 1 : 0, textdata[i]) |> sum
    end
    occurs
end

function processS(textdata, searchset)
    n = size(textdata, 1)
    occurs = Vector{Int}(undef, n)
    for i in 1:n
        occurs[i] = map(t-> t in searchset ? 1 : 0, textdata[i]) |> sum
    end
    occurs
end

@time textdata, searchterms = fill(1_000_000);
                                                                     
1.287460 seconds (14.00 M allocations: 845.350 MiB, 58.86% gc time)  

@time searchset = Set(searchterms[1:200_000]);
 0.022542 seconds (10 allocations: 6.026 MiB)
@time result = processT(textdata, searchset); # 40 threads
  0.033092 seconds (1.00 M allocations: 106.820 MiB)
@time result = processS(textdata, searchset); # single threaded
 0.722141 seconds (1.00 M allocations: 106.801 MiB)

aourednik · December 7, 2021, 4:24pm

Very nice!

Event the single-threaded Julia version runs faster than the 8-threaded R call to my previous countnameinstring R function !

> countnameinstringarallel <- function(x,dict,rowsplit) {
+   if (length(x)<rowsplit) {
+     return(countnameinstring(x,dict))
+   } else {
+     myseq <- seq(0,length(x),rowsplit)
+     if( length(x) %% rowsplit > 0) myseq <- c(myseq,length(x))
+     cl <- parallel::makeCluster(8) 
+     doParallel::registerDoParallel(cl)
+     foreach(
+       i=1:(length(myseq)-1),
+       .packages = c("stringr","data.table","matrixStats"),
+       .export = c("dict","countnameinstring")
+     ) %dopar% {
+       countnameinstring(x[(myseq[i]+1):(myseq[i+1])],dict)
+     } %>% unlist
+   }
+ }
> 
> system.time({
+   text.data[ , dicmatchinphrase := countnameinstringarallel(textlong,reference$ref,10000)]
+ }) 
   user  system elapsed 
  0.840   0.286   8.538

lawless-m · December 7, 2021, 4:37pm

The Julia version on your M1 Pro might be even quicker than my creaky old Xeon

We’ve seen benchmarks where the M1 is 2x-3x faster, even using all my 40 cores vs M1’s 8 cores

Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

aourednik · December 7, 2021, 4:49pm

Riding an eloctrobike in the city, I don’t have a car, so I bought myself one of those:

julia version 1.7.0
Model Identifier: MacBookPro18,1
Chip: Apple M1 Pro
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 32 GB
System Version: macOS 12.0.1 (21A559)
Kernel Version: Darwin 21.1.0

With your code:

@time textdata, searchterms = fill(1_000_000);
# 0.543755 seconds (14.22 M allocations: 858.079 MiB, 34.46% gc time, 5.28% compilation time)
@time searchset = Set(searchterms[1:200_000]);
# 0.010282 seconds (10 allocations: 6.026 MiB)
@time result = processT(textdata, searchset); # 8 threads
# 0.079716 seconds (1.12 M allocations: 113.163 MiB, 31.10% compilation time)
@time result = processS(textdata, searchset); # single threaded
# 0.439186 seconds (1.07 M allocations: 110.820 MiB, 3.45% compilation time)

aourednik · December 7, 2021, 5:23pm

The next nut to crack would be string replacement i.e., to port this one:
replaces parts of text (xml annotation) when an n_gram from a 200k dictionary is found in those 1 million texts.

Should I start a new topic or leave it here, since working on the same random data?

# prepare a second, longer dictionary with ngrams (1-gram to 4-gram) ----
reference2_ngrams <- text.data$textlong %>% sapply(.,function(x){word(x,sample(1:2,1),sample(1:4,1))}) %>% .[!is.na(.) & str_length(.)>0 ] %>% unique %>% .[1:200000]

# replacement pattern
myPattern <- paste0("<l>",reference2_ngrams,"</l>") %>% setNames(.,reference2_ngrams)

system.time({
	text.data[1:100,textAnnotated := str_replace_all(
	  textlong,
	  fixed(myPattern)
	)]
})
#    user  system elapsed 
#  8.456   0.468   8.940

Topic		Replies	Views
Implementation of Norvigs spellchecker (Code critique + performance help) Performance question , performance	21	2273	January 14, 2020
Count words challenge Performance	24	2250	March 23, 2021
Fast String processing in Julia vs. Python General Usage	38	6990	March 7, 2017
Julia programs now shown on benchmarks game website Community announcement	144	13962	December 3, 2019
Help to get my slow Julia code to run as fast as Rust/Java/Lisp Performance	100	4805	August 6, 2021

Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary

Related topics