unexpected pmap behaviour

performance
parallel
regex
#1

Hello,

I am currently working on a project where I need to match regular expressions against different sequences (strings).
On the one hand, these sequences can vary in size but I can easily fit them on all workers separately (no memory issue here).
On the other hand, I am trying to find the matches of many (~ 10⁶) of regular expressions (“regs” in the example).
While the code is probably not perfect, it runs well with map.
As the number of regexes to match will increases, I will need to parallelize the search.
The problem is when switching from map to pmap :

  • there seems to be an increase in memory which I cannot explain when calling multiple times the pmap line
  • the time required by pmap to return the results varies (between ~4.4s and 7.7s on my laptop).
  • The system monitor indicates that 1 (out of 8) workers takes a lot more time to return than the others.

Here is a simpler version of the code would be something like:

using Distributed: pmap, @everywhere
using Random

seq = randstring(MersenneTwister(3), 'a':'z', 300) # 1 string for the example

@everywhere seq = $seq

@everywhere function getmatches(rs::Array{Regex,1})
    matches = []
    for r=rs
        push!(matches, collect(eachmatch(r,seq,overlap=true)))
    end
    matches
end

regs = repeat([Regex.(string.(collect('a':'z')))],800)

pmap(getmatches,regs,batch_size = 100) # running on 8 cores

this is the version of Julia I am currently using :
Julia Version 1.1.0
Commit 80516ca202* (2019-01-21 21:24 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel® Core™ i7-6820HQ CPU @ 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

I am not sure if its a known issue with pmap or if I am doing something blatantly wrong so any help/suggestions from Julia experts are welcome!

Thanks in advance,

David

0 Likes