Regular Expression and Threads

Regexps seems to be thread-safe.

using ThreadTools
const urlregexp = r"^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{2,3}(\.[^:\/\s\.]{2,3})?)(:\d+)?($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$";
is_url(s::AbstractString) = occursin(urlregexp, s)
urls = readlines("urls")

function getrange(n, tid = Threads.threadid(), nt = Threads.nthreads())
    d , r = divrem(n, nt)
    from = (tid - 1) * d + min(r, tid - 1) + 1
    to = from + d - 1 + (tid ≤ r ? 1 : 0)
    from:to
end

@btime map(is_url, urls);
# 343.633 ms (3 allocations: 914.53 KiB)
@btime tmap(is_url, urls);
# 1.656 s (9362901 allocations: 721.62 MiB)
@btime reduce(vcat, tmap(i -> map(is_url, urls[i]), [getrange(length(urls), i) for i in 1:Threads.nthreads()]));
# 70.566 ms (111 allocations: 8.94 MiB)

URLs was a list with approximately 1M strings.

The conclusion is that Regexps are indeed thread safe. Also, tmap spinning a single thread for each item is kind of wasteful here and “handcoded” variant which spins one thread for block seems to be better.

1 Like