So far, my R code (calling Cpp libraries data.table and matrixStats) is much faster. Can you help improve the Julia translation ? I.e. optimize the last two lines of the code below.
## Imports
import Pkg; Pkg.add("DataFrames")
using Random, DataFrames, StatsBase
## Preparing random text sample and dictionary
mseq = 1:1000000
reference = String[]
textdata = DataFrame(
id = mseq,
textlong = "",
occurs = 0
)
Threads.@threads for i in mseq
push!(reference,randstring('A':'Z', 5))
textdata[i,"textlong"] = join([randstring('A':'Z', 5) for i in 1:rand(1:10)]," ")
end
reference = unique(reference)
reference = reference[1:200000]
## Running the detection
### for each reference entry generate a bytestring vector of length = nrows of textdata matrix
### then sums vectors
textdata.occurs = [occursin.(x,textdata.textlong) for x = reference] |> sum
result = filter(row -> row.occurs > 0 ,textdata)
Followed some performance tips with no noticeable change for any of those I managed to use. Either they have no effect or don’t seem to apply to my use-case.
It is very hard to understand how to do promise / await, like in JS; also no async-safe package for array-pushing like foreach in R.
I’m giving up and returning to R for now.
Perhaps in Julia 2 code will be simpler to optimize and more higher-level wrappers will be around. The version below is the best I’ve got so far.
import Pkg; Pkg.add("DataFrames")
Threads.nthreads()
## Preparing random text sample and dictionary a
### This part only takes less than a second
textdata_ids = 1:100000
textdata_textlong = String[] ### Types should be declared for performance
textdata_occurs = Int[]
reference = String[]
using Random
for i in textdata_ids
push!(reference,randstring('A':'Z', 5))
push!(textdata_textlong,join([randstring('A':'Z', 5) for j in 1:rand(1:10)]," "))
end
reference = unique(reference)
reference = reference[1:2000]
## Define the detection function and run it
### "Any code that is performance critical should be inside a function. Code inside functions tends to run much faster than top level code, due to how Julia's compiler works."
function detectwordinphrase(words::Vector{String},phrases::Vector{String})
# occurrences = [occursin.(x,phrases) for x::String = words]
# occurrences::Vector{BitVector} = [occursin.(x,phrases) for x::String = words]
# occurrences = Vector{BitVector}(undef,length(words))
occurrences = BitVector[]
for i = 1:size(words,1) # This loop takes all of the time
push!(occurrences,occursin.(words[i],phrases)) # LoadError: UndefRefError: access to undefined reference
# occurrences[i] = occursin.(words[i],phrases)
end
# wait(occurrences)
return sum(occurrences) # This takes no time at all
end;
##
# Running the detection:
@time textdata_occurs = detectwordinphrase(reference,textdata_textlong)
# 10k phrases 2k words : 0.929850 seconds (40.01 M allocations: 2.544 GiB, 7.86% gc time)
# 100k phrases 2k words : 11.996917 seconds (400.01 M allocations: 25.363 GiB, 7.66% gc time)
# 100k phrases 20k words : 112.146993 seconds (4.00 G allocations: 253.637 GiB, 11.36% gc time)
using DataFrames
textdata=DataFrame(
ids = textdata_ids,
textlong = textdata_textlong,
occurs = textdata_occurs
)
result = filter(row -> row.occurs > 0 ,textdata)
Start Julia with julia -p N where N is the number of processes you want to use (e.g., julia -p 8). Then perhaps something like:
(base) oscar@Oscars-MBP /tmp % ~/julia --project=df -p 8
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.6.2 (2021-07-14)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using Distributed
julia> using Random
julia> function generate_data(N, M)
textdata_ids = 1:N
textdata_textlong = String[]
reference = String[]
for i in textdata_ids
push!(reference, randstring('A':'Z', 5))
push!(textdata_textlong, join([randstring('A':'Z', 5) for j in 1:rand(1:10)]," "))
end
reference = unique(reference)
reference = reference[1:M]
return reference, textdata_textlong
end
generate_data (generic function with 1 method)
julia> function detectwordinphrase(words::Vector{String}, phrases::Vector{String})
pool = CachingPool(workers())
return let words = words
pmap(pool, phrases) do phrase
return sum(occursin(word, phrase) for word in words)
end
end
end
detectwordinphrase (generic function with 1 method)
julia> reference, textdata_textlong = generate_data(100_000, 20_000);
julia> @time textdata_occurs = detectwordinphrase(reference, textdata_textlong);
38.472972 seconds (8.56 M allocations: 343.730 MiB, 0.38% gc time, 1.70% compilation time)
I thought I would take this on as a challenge and there was a surpise in store.
It’s not exactly related to performance but … (well it is on single threads because push! is slower - see below)
push! is not threadsafe
julia> Threads.nthreads()
40
julia> a = Vector{String}()
julia> Threads.@threads for i in 1:1_000_000
push!(a, "$i")
end
Segmentation fault
and even if it doesn’t crash (which is likely as it does take me about 1m pushes to make it crash)
the result is undefined
julia> a = Vector{String}()
julia> Threads.@threads for i in 1:1_000
push!(a, "$i")
end
julia> length(a)
978
A correct approach is to pre-allocate the vector, and this is also a performance tip, as push! is much slower in single threaded than doing pre-allocation
julia> a = Vector{String}(undef, 1_000_000);
julia> Threads.@threads for i in 1:1_000_000
a[i] = "$i"
end
julia>
Many thanks!
I did run Julia with 4 threads. I’m using the Julia language support in VScode and discovered it has a nice ‘Num Threads’ parameter in the GUI extension parameters.
The function below was giving me errors yesterday night, though, for a mysterious reason, now it runs, with the exec time halved:
function detectwordinphrase(words::Vector{String},phrases::Vector{String})
occurrences = Vector{BitVector}(undef,length(words))
Threads.@threads for i = 1:size(words,1)
occurrences[i] = occursin.(words[i],phrases)
end
return sum(occurrences)
end;
@time textdata_occurs = detectwordinphrase(reference,textdata_textlong)
# 100k phrases 2k words : 5.061899 seconds (400.01 M allocations: 25.363 GiB, 37.20% gc time)
# 100k phrases 20k words : 50.659987 seconds (4.00 G allocations: 253.655 GiB, 35.53% gc time, 0.12% compilation time)
Does putting the data generation in a function really help the detection function? It already was the quick part of the code, despite reference and textdata_textlong being globals. Are they not globals if using
Threads.@threads for... seems faster than pool = CachingPool(workers()) on my architecture (Apple Silicon M1 arm64)
## Detection 1
### "Any code that is performance critical should be inside a function. Code inside functions tends to run much faster than top level code, due to how Julia's compiler works."
function detectwordinphrase(words::Vector{String},phrases::Vector{String})
# occurrences = [occursin.(x,phrases) for x::String = words]
occurrences = Vector{BitVector}(undef,length(words))
Threads.@threads for i = 1:size(words,1) # This loop takes all of the time
occurrences[i] = occursin.(words[i],phrases)
end
return sum(occurrences) # This takes no time at all
end;
@time textdata_occurs = detectwordinphrase(reference,textdata_textlong)
# Single threaded:
# 10k phrases 2k words : 0.929850 seconds (40.01 M allocations: 2.544 GiB, 7.86% gc time)
# 100k phrases 2k words : 11.996917 seconds (400.01 M allocations: 25.363 GiB, 7.66% gc time)
# 100k phrases 20k words : 112.146993 seconds (4.00 G allocations: 253.637 GiB, 11.36% gc time)
# Multithreaded 4 threads
# 100k phrases 2k words : 5.061899 seconds (400.01 M allocations: 25.363 GiB, 37.20% gc time)
# 100k phrases 20k words : 50.659987 seconds (4.00 G allocations: 253.655 GiB, 35.53% gc time)
# Multithreaded 8 Threads
# 100k phrases 20k words : 37.712681 seconds (4.00 G allocations: 253.655 GiB, 51.64% gc time, 0.16% compilation time)
## Detection 2:
using Distributed
function detectwordinphrase2(words::Vector{String}, phrases::Vector{String})
pool = CachingPool(workers())
return let words = words
pmap(pool, phrases) do phrase
return sum(occursin(word, phrase) for word in words)
end
end
end
@time textdata_occurs = detectwordinphrase2(reference,textdata_textlong)
# Multithreaded 4 Threads
# 100k phrases 20k words : 88.694093 seconds (4.00 G allocations: 238.460 GiB, 21.67% gc time, 0.18% compilation time)
# Multithreaded 8 Threads
# 100k phrases 20k words : 90.496826 seconds (4.00 G allocations: 238.460 GiB, 21.99% gc time, 0.17% compilation time)
I’ve been lazy in the construction of the random dataset. My real world applicaiton has variable string lengths.
If == can handle them, I take it.
The operation that made my R code fast was the as.matrix(splitedv) %>% rowAnys, i.e. making operations on whole vectors ( for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% dict)) , than summing, “row-wise” the columns of the matrix. rowAnys is actually boolean. It comes from the powerfull matrixStats package. The %chin% part, equivalent to occursin comes from data.table.
detectnameinstring <- function(x,dict) {
splitedv <- tstrsplit(x,"[ ]") %>% as.data.table
cols <- colnames(splitedv)
for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% dict)
return(as.matrix(splitedv) %>% rowAnys)
}
R does occurence count of 200k words in 1M phrases in 11 seconds on a Apple M1 Pro, even without using explicit mulithreading (though data.table reports running 5 threads under the hood). To count instead of just detecting (true/false), one can just replace rowAnys by rowSums.
Strangely enough, what takes R long is the preparation of the random dataset, that Julia handles in less than a second. But detection is incomparably faster
R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20.6.0 (64-bit)
> library("data.table")
data.table 1.14.2 using 5 threads (see ?getDTthreads). Latest news: r-datatable.com
> library("magrittr")
> library("matrixStats")
> library("stringr")
> randomString <- function(n = 5000) {
+ a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
+ paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
+ return(a)
+ }
> text.data <- data.table(id = 1:1000000)
> system.time({
+ text.data[, text := {randomString(nrow(text.data))} ]
+ })
user system elapsed
1.718 0.038 1.756
> system.time({
+ text.data[, textlong := sapply(id, function(x) randomString(round(runif(1,1,10))) %>% paste(.,collapse=" ")) ]
+ })
user system elapsed
70.678 0.858 71.513
> reference <- data.table(ref = sample(unique(text.data$text),200000))
> countnameinstring <- function(x,dict) {
+ splitedv <- tstrsplit(x,"[ ]") %>% as.data.table
+ cols <- colnames(splitedv)
+ for (j in cols) set(splitedv, j = j, value = splitedv[[j]] %chin% dict)
+ return(as.matrix(splitedv) %>% rowSums)
+ }
>
> system.time({
+ text.data[ , dicmatchinphrase2 := countnameinstring(textlong,reference$ref)]
+ })
user system elapsed
11.935 0.079 12.010
function fill(n)
## Preparing random text sample and dictionary
searchterms = Vector{String}(undef, n)
textdata = Vector{Vector{String}}(undef, n)
Threads.@threads for i in 1:n
searchterms[i] = randstring('A':'Z', 5)
textdata[i] = [randstring('A':'Z', 5) for _ in 1:rand(1:10)]
end
textdata, unique(searchterms)
end
function processT(textdata, searchset)
n = size(textdata, 1)
occurs = Vector{Int}(undef, n)
Threads.@threads for i in 1:n
occurs[i] = map(t-> t in searchset ? 1 : 0, textdata[i]) |> sum
end
occurs
end
function processS(textdata, searchset)
n = size(textdata, 1)
occurs = Vector{Int}(undef, n)
for i in 1:n
occurs[i] = map(t-> t in searchset ? 1 : 0, textdata[i]) |> sum
end
occurs
end
@time textdata, searchterms = fill(1_000_000);
1.287460 seconds (14.00 M allocations: 845.350 MiB, 58.86% gc time)
@time searchset = Set(searchterms[1:200_000]);
0.022542 seconds (10 allocations: 6.026 MiB)
@time result = processT(textdata, searchset); # 40 threads
0.033092 seconds (1.00 M allocations: 106.820 MiB)
@time result = processS(textdata, searchset); # single threaded
0.722141 seconds (1.00 M allocations: 106.801 MiB)
Riding an eloctrobike in the city, I don’t have a car, so I bought myself one of those:
julia version 1.7.0
Model Identifier: MacBookPro18,1
Chip: Apple M1 Pro
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 32 GB
System Version: macOS 12.0.1 (21A559)
Kernel Version: Darwin 21.1.0
With your code:
@time textdata, searchterms = fill(1_000_000);
# 0.543755 seconds (14.22 M allocations: 858.079 MiB, 34.46% gc time, 5.28% compilation time)
@time searchset = Set(searchterms[1:200_000]);
# 0.010282 seconds (10 allocations: 6.026 MiB)
@time result = processT(textdata, searchset); # 8 threads
# 0.079716 seconds (1.12 M allocations: 113.163 MiB, 31.10% compilation time)
@time result = processS(textdata, searchset); # single threaded
# 0.439186 seconds (1.07 M allocations: 110.820 MiB, 3.45% compilation time)
The next nut to crack would be string replacement i.e., to port this one:
replaces parts of text (xml annotation) when an n_gram from a 200k dictionary is found in those 1 million texts.
Should I start a new topic or leave it here, since working on the same random data?
# prepare a second, longer dictionary with ngrams (1-gram to 4-gram) ----
reference2_ngrams <- text.data$textlong %>% sapply(.,function(x){word(x,sample(1:2,1),sample(1:4,1))}) %>% .[!is.na(.) & str_length(.)>0 ] %>% unique %>% .[1:200000]
# replacement pattern
myPattern <- paste0("<l>",reference2_ngrams,"</l>") %>% setNames(.,reference2_ngrams)
system.time({
text.data[1:100,textAnnotated := str_replace_all(
textlong,
fixed(myPattern)
)]
})
# user system elapsed
# 8.456 0.468 8.940