Regular Expression and Threads

Hi All,

in our Flux ML project, we need to process large volume of data with a single models. We do it by dividing the minibatch into chunks, calculate gradient on each chunk in a separate thread, and reduce it (is it called model paralelism?). For convenience, it is wrapped in this small project
https://github.com/pevnak/TrainTools.jl, which paralelizes construction of minibatches as well. We have succesfully tested this scenario and it was working reasonably well.

As the project evolves, we have in our models a special string nodes containing URLs, filepath, which are expanded inside the application of the model. When we use this expansion, the multi-threadding stops working and everything is effectively calculated on a single threads. Therefore the obvious question I want to ask, is if regular expressions are compatible with threadding? I use Julia 1.3.0-rc3.0.

Would https://github.com/BioJulia/Automa.jl be compatible with multi-threading?

Thanks for answer in advance.

Having a quick look at the Base code, regular expressions should be threadsafe to create and use in julia 1.3, but regex compilation is serialized behind a global lock. I assume there shouldn’t be any real contention if you’re just trying to use regexes to match strings, but that’s a question about the pcre2 library. You could read deeper into the source, or just do the simple thing: write a simple regex benchmark to see how matching throughput changes with the number of threads.

I’m not sure about Automa but as a pure Julia code generator I would guess the generated code is thread safe.

Thanks, I have asked my colleague (a seasoned pythonist and a Julia newbie) to take a look on it. I will post our conclusions.

Regexps seems to be thread-safe.

using ThreadTools
const urlregexp = r"^((http[s]?|ftp):\/\/)?\/?([^\/\.]+\.)*?([^\/\.]+\.[^:\/\s\.]{2,3}(\.[^:\/\s\.]{2,3})?)(:\d+)?($|\/)([^#?\s]+)?(.*?)?(#[\w\-]+)?$";
is_url(s::AbstractString) = occursin(urlregexp, s)
urls = readlines("urls")

function getrange(n, tid = Threads.threadid(), nt = Threads.nthreads())
    d , r = divrem(n, nt)
    from = (tid - 1) * d + min(r, tid - 1) + 1
    to = from + d - 1 + (tid ≤ r ? 1 : 0)
    from:to
end

@btime map(is_url, urls);
# 343.633 ms (3 allocations: 914.53 KiB)
@btime tmap(is_url, urls);
# 1.656 s (9362901 allocations: 721.62 MiB)
@btime reduce(vcat, tmap(i -> map(is_url, urls[i]), [getrange(length(urls), i) for i in 1:Threads.nthreads()]));
# 70.566 ms (111 allocations: 8.94 MiB)

URLs was a list with approximately 1M strings.

The conclusion is that Regexps are indeed thread safe. Also, tmap spinning a single thread for each item is kind of wasteful here and “handcoded” variant which spins one thread for block seems to be better.

1 Like