Regular Expression and Threads

in our Flux ML project, we need to process large volume of data with a single models. We do it by dividing the minibatch into chunks, calculate gradient on each chunk in a separate thread, and reduce it (is it called model paralelism?). For convenience, it is wrapped in this small project, which paralelizes construction of minibatches as well. We have succesfully tested this scenario and it was working reasonably well.

As the project evolves, we have in our models a special string nodes containing URLs, filepath, which are expanded inside the application of the model. When we use this expansion, the multi-threadding stops working and everything is effectively calculated on a single threads. Therefore the obvious question I want to ask, is if regular expressions are compatible with threadding? I use Julia 1.3.0-rc3.0.

Would be compatible with multi-threading?

Having a quick look at the Base code, regular expressions should be threadsafe to create and use in julia 1.3, but regex compilation is serialized behind a global lock. I assume there shouldn’t be any real contention if you’re just trying to use regexes to match strings, but that’s a question about the pcre2 library. You could read deeper into the source, or just do the simple thing: write a simple regex benchmark to see how matching throughput changes with the number of threads.

I’m not sure about Automa but as a pure Julia code generator I would guess the generated code is thread safe.

Thanks, I have asked my colleague (a seasoned pythonist and a Julia newbie) to take a look on it. I will post our conclusions.

Regexps seems to be thread-safe.

using ThreadTools
const urlregexp = r"^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{2,3}(\.[^:\/\s\.]{2,3})?)(:\d+)?($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$";
is_url(s::AbstractString) = occursin(urlregexp, s)
urls = readlines("urls")

function getrange(n, tid = Threads.threadid(), nt = Threads.nthreads())
    d , r = divrem(n, nt)
    from = (tid - 1) * d + min(r, tid - 1) + 1
    to = from + d - 1 + (tid ≤ r ? 1 : 0)

@btime map(is_url, urls);
# 343.633 ms (3 allocations: 914.53 KiB)
@btime tmap(is_url, urls);
# 1.656 s (9362901 allocations: 721.62 MiB)
@btime reduce(vcat, tmap(i -> map(is_url, urls[i]), [getrange(length(urls), i) for i in 1:Threads.nthreads()]));
# 70.566 ms (111 allocations: 8.94 MiB)

URLs was a list with approximately 1M strings.

The conclusion is that Regexps are indeed thread safe. Also, tmap spinning a single thread for each item is kind of wasteful here and “handcoded” variant which spins one thread for block seems to be better.

