Hi everyone - - I have some questions about using pmap
. I’m trying to use pmap
to run some embarrassingly parallel code on Amazon EC2. I’ve been running into a problem where it seems each worker keeps using up more and more memory, to the point where one of the workers terminates, and the others then stop as well. I’d like to understand how this happens and what I can do to better manage the execution.
Here is an example. I start a t2.large
instance (AMI: JuliaPro 0.6.2.1_mkl RHEL 7.4 (ami-01e5a57b)
). I then run Julia:
$ JuliaPro-0.6.2.1/julia
Running top at this stage gives me the following:
10359 ec2-user 20 0 113128 1400 1204 S 0.0 0.0 0:00.00 julia
10365 ec2-user 20 0 479876 146188 48328 S 0.0 1.8 0:00.40 julia
I create the workers and dummy functions for pmap
:
julia> addprocs(4)
4-element Array{Int64,1}:
2
3
4
5
julia> @everywhere f = function(x)
A = rand(x*10000,x*10000)
return 0
end
julia> err_fn = function(e)
@show e
@show myid()
end
(::#7) (generic function with 1 method)
julia> @show nprocs()
nprocs() = 5
5
Running top
now shows the Julia processes of the workers:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10359 ec2-user 20 0 113128 1400 1204 S 0.0 0.0 0:00.00 julia
10365 ec2-user 20 0 648556 172100 49648 S 0.0 2.1 0:01.72 julia
10368 ec2-user 20 0 479380 145884 48312 S 0.0 1.8 0:00.44 julia
10369 ec2-user 20 0 578408 147696 48540 S 0.0 1.8 0:00.59 julia
10371 ec2-user 20 0 578312 150800 48512 S 0.0 1.9 0:00.59 julia
10373 ec2-user 20 0 578864 149972 48512 S 0.0 1.9 0:00.57 julia
When I now run pmap
, some of the workers give OutOfMemory() errors, and the process eventually exits:
julia> out = pmap(f, 1:15, on_error = err_fn)
From worker 5: e = OutOfMemoryError()
From worker 5: myid() = 5
From worker 5: e = OutOfMemoryError()
From worker 5: myid() = 5
From worker 5: e = OutOfMemoryError()
From worker 5: myid() = 5
From worker 5: e = OutOfMemoryError()
From worker 5: myid() = 5
From worker 5: e = OutOfMemoryError()
From worker 5: myid() = 5
From worker 2: e = OutOfMemoryError()
From worker 2: myid() = 2
From worker 5: e = OutOfMemoryError()
From worker 5: myid() = 5
From worker 5: e = OutOfMemoryError()
From worker 5: myid() = 5
From worker 2: e = OutOfMemoryError()
From worker 2: myid() = 2
From worker 5: e = OutOfMemoryError()
From worker 5: myid() = 5
From worker 2: e = OutOfMemoryError()
From worker 2: myid() = 2
From worker 5: e = OutOfMemoryError()
From worker 5: myid() = 5
Worker 4 terminated.
ERROR: ProcessExitedException()ERROR (unhandled task failure): EOFError: read end of file
Stacktrace:
[1] #571 at ./asyncmap.jl:178 [inlined]
[2] foreach(::Base.##571#573, ::Array{Any,1}) at ./abstractarray.jl:1733
[3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./asyncmap.jl:178
[4] wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Base.Distributed.##204#207{WorkerPool}, ::Function, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./asyncmap.jl:154
[5] #async_usemap#556(::Function, ::Void, ::Function, ::Base.Distributed.##188#190, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./asyncmap.jl:103
[6] (::Base.#kw##async_usemap)(::Array{Any,1}, ::Base.#async_usemap, ::Function, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./<missing>:0
[7] (::Base.#kw##asyncmap)(::Array{Any,1}, ::Base.#asyncmap, ::Function, ::UnitRange{Int64}) at ./<missing>:0
[8] #pmap#203(::Bool, ::Int64, ::Function, ::Array{Any,1}, ::Void, ::Function, ::WorkerPool, ::Function, ::UnitRange{Int64}) at ./distributed/pmap.jl:126
[9] (::Base.Distributed.#kw##pmap)(::Array{Any,1}, ::Base.Distributed.#pmap, ::WorkerPool, ::Function, ::UnitRange{Int64}) at ./<missing>:0
[10] #pmap#213(::Array{Any,1}, ::Function, ::Function, ::UnitRange{Int64}) at ./distributed/pmap.jl:156
[11] (::Base.Distributed.#kw##pmap)(::Array{Any,1}, ::Base.Distributed.#pmap, ::Function, ::UnitRange{Int64}) at ./<missing>:0
julia> @show nprocs()
nprocs() = 4
4
Checking top
now: (bolded worker process has a lot of memory)
10359 ec2-user 20 0 113128 196 0 S 0.0 0.0 0:00.00 julia
10365 ec2-user 20 0 713632 170536 15616 S 0.0 2.1 0:03.91 julia
10368 ec2-user 20 0 612844 102656 760 S 0.0 1.3 0:01.33 julia
**10369 ec2-user 20 0 3770456 3.082g 828 S 0.0 40.4 0:02.05 julia**
10373 ec2-user 20 0 711352 106108 764 S 0.0 1.3 0:01.38 julia
When I run @everywhere gc()
, things become more normal:
10359 ec2-user 20 0 113128 196 0 S 0.0 0.0 0:00.00 julia
10365 ec2-user 20 0 713764 154368 15864 S 0.0 1.9 0:04.09 julia
10368 ec2-user 20 0 612844 113224 6304 S 0.0 1.4 0:01.44 julia
10369 ec2-user 20 0 645452 112880 6584 S 0.0 1.4 0:02.12 julia
10373 ec2-user 20 0 711352 111652 6308 S 0.0 1.4 0:01.49 julia
My questions are:
- Is there a way to run
pmap
you would recommend that would be robust to these types of failures? - This is not shown in the example, but I have noticed that when I run my own code, the memory usage of each worker seems to keep increasing, despite trying to invoke the garbage collector by setting an array to 0 and invoking
@everywhere gc()
. I do not encounter this when trying to run the code locally. Is this normal? Is there a way to completely clear/reset a worker each time it executes the function given topmap
? - (More generally) Why are there two Julia processes when I start Julia on EC2?
- (Related to 2) Does Julia tend to use more memory on EC2 than on “normal” machines?
I apologize if these are stupid questions - thank you in advance for any suggestions or information you can provide.