Hello,
I am currently working on a project where I need to match regular expressions against different sequences (strings).
On the one hand, these sequences can vary in size but I can easily fit them on all workers separately (no memory issue here).
On the other hand, I am trying to find the matches of many (~ 10⁶) of regular expressions (“regs” in the example).
While the code is probably not perfect, it runs well with map.
As the number of regexes to match will increases, I will need to parallelize the search.
The problem is when switching from map to pmap :
- there seems to be an increase in memory which I cannot explain when calling multiple times the pmap line
- the time required by pmap to return the results varies (between ~4.4s and 7.7s on my laptop).
- The system monitor indicates that 1 (out of 8) workers takes a lot more time to return than the others.
Here is a simpler version of the code would be something like:
using Distributed: pmap, @everywhere
using Random
seq = randstring(MersenneTwister(3), 'a':'z', 300) # 1 string for the example
@everywhere seq = $seq
@everywhere function getmatches(rs::Array{Regex,1})
matches = []
for r=rs
push!(matches, collect(eachmatch(r,seq,overlap=true)))
end
matches
end
regs = repeat([Regex.(string.(collect('a':'z')))],800)
pmap(getmatches,regs,batch_size = 100) # running on 8 cores
this is the version of Julia I am currently using :
Julia Version 1.1.0
Commit 80516ca202* (2019-01-21 21:24 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core™ i7-6820HQ CPU @ 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
I am not sure if its a known issue with pmap or if I am doing something blatantly wrong so any help/suggestions from Julia experts are welcome!
Thanks in advance,
David