Workers process memory leaks v1.1.0

parallel
#1

HI, Community!

I faced with strange problem. It is probably a memory leak on worker process.

I use command for running instance, like this:

/opt/julia-1.1.0/bin/julia --project=/opt/seismo-api-gm-concave-module.jl --load=/opt/seismo-api-gm-concave-module.jl/conf/seismo-api-gm-concave-module.conf --procs=3 /opt/seismo-api-gm-concave-module.jl/runservice.jl

Environment:

julia version 1.1.0

Linux gmm-1 4.4.0-141-generic #167-Ubuntu SMP Wed Dec 5 10:40:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu 16.04.5 LTS

After start I see 4 process: 1 main proc, and 3 workers. Workers start automatically by this command (ps auxf):

/opt/julia-1.1.0/bin/julia -Cnative -J/opt/julia-1.1.0/lib/julia/sys.so -g1 --bind-to 127.0.0.1 --worker

The main process initiate processing jobs from RemoteChannel like it described in docs:

const jobs = RemoteChannel(()->Channel{Any}(10000));
@everywhere function do_work(jobs) # define work function everywhere
  while true
    event_set = take!(jobs)
    try
      result = pga_concave_hull_proc(event_set["event_id"],
                                     event_set["model"],
                                     event_set["config"])
      @info "POSTing data OK $(event_set["event_id"]) at $(now()) on worker $(myid()) and pid $(getpid())" result
    catch err
      @warn err
      @warn "ERROR occured while processing ** $(event_set["event_id"]) ** at $(now()) on worker $(myid()) and pid $(getpid())"
    finally
      @info "perform garbidge collection" GC.gc() InteractiveUtils.varinfo()
    end
  end
end

When it starts it takes only 1,5% of memory on every worker. These are modules, global constants, some functions propagated to workers. Let’s see the varinfo() on worker that run inside a do_work(jobs) function:

Feb 14 13:08:20 gmm-1 julia[6750]: ┌ Info: perform garbidge collection
Feb 14 13:08:20 gmm-1 julia[6750]: │   GC.gc() = nothing
Feb 14 13:08:20 gmm-1 julia[6750]: │   InteractiveUtils.varinfo() =
Feb 14 13:08:20 gmm-1 julia[6750]: │    name                                   size summary
Feb 14 13:08:20 gmm-1 julia[6750]: │    ––––––––––––––––––––––––––––––––– ––––––––– ––––––––––––––––––––––––––––––––––––––
Feb 14 13:08:20 gmm-1 julia[6750]: │    API_TOKEN                         1.150 KiB String
Feb 14 13:08:20 gmm-1 julia[6750]: │    Base                                        Module
Feb 14 13:08:20 gmm-1 julia[6750]: │    Core                                        Module
Feb 14 13:08:20 gmm-1 julia[6750]: │    DB                                 59 bytes SQLite.DB
Feb 14 13:08:20 gmm-1 julia[6750]: │    Distributed                       1.243 MiB Module
Feb 14 13:08:20 gmm-1 julia[6750]: │    GET_REPORT_URL                     51 bytes String
Feb 14 13:08:20 gmm-1 julia[6750]: │    HEADERS_BASE                      1.666 KiB Dict{String,String} with 2 entries
Feb 14 13:08:20 gmm-1 julia[6750]: │    IP_ADDRESS                         19 bytes String
Feb 14 13:08:20 gmm-1 julia[6750]: │    MAX_LAT_GRID                        8 bytes Int64
Feb 14 13:08:20 gmm-1 julia[6750]: │    MAX_LON_GRID                        8 bytes Int64
Feb 14 13:08:20 gmm-1 julia[6750]: │    Main                                        Module
Feb 14 13:08:20 gmm-1 julia[6750]: │    PGA_INTERVALS                     112 bytes 9-element Array{Float64,1}
Feb 14 13:08:20 gmm-1 julia[6750]: │    PORT                                8 bytes Int64
Feb 14 13:08:20 gmm-1 julia[6750]: │    POST_PGA_CONCAVE_URL               74 bytes String
Feb 14 13:08:20 gmm-1 julia[6750]: │    concave_hull_with_log               0 bytes typeof(concave_hull_with_log)
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_crustal_pga         139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_crustal_pga_2       139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_crustal_pgv         139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_crustal_psa_03      139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_crustal_psa_10      139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_crustal_psa_30      139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_interplate_pga      139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_interplate_pgv      139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_interplate_psa_03   139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_interplate_psa_10   139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_interplate_psa_30   139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_intraplate_pga      139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_intraplate_pga_asid 139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_intraplate_pgv      139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_intraplate_psa_03   139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_intraplate_psa_10   139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    config_mf2013_intraplate_psa_30   139 bytes Params_mf2013
Feb 14 13:08:20 gmm-1 julia[6750]: │    do_work                             0 bytes typeof(do_work)
Feb 14 13:08:20 gmm-1 julia[6750]: │    get_grid                            0 bytes typeof(get_grid)
Feb 14 13:08:20 gmm-1 julia[6750]: │    json_concave_data                   0 bytes typeof(json_concave_data)
Feb 14 13:08:20 gmm-1 julia[6750]: │    parse_loc_values                    0 bytes typeof(parse_loc_values)
Feb 14 13:08:20 gmm-1 julia[6750]: │    parse_report_version_timestamp      0 bytes typeof(parse_report_version_timestamp)
Feb 14 13:08:20 gmm-1 julia[6750]: │    pga_concave_hull_proc               0 bytes typeof(pga_concave_hull_proc)
Feb 14 13:08:20 gmm-1 julia[6750]: │    post_pga_concave                    0 bytes typeof(post_pga_concave)
Feb 14 13:08:20 gmm-1 julia[6750]: └    project_dir                        44 bytes String

and this varinfo() output does not change even after 1000 jobs! Because all work performs inside a functions and global objects does not created besides initial global constants. So varinfo() output is identical after any times to run.

But the memory of each process is growing. I observe such a situation:

  • one process killed by linux oom_killer.
  • two workers take ~70.0% of system memory:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
juliaus+  6750  0.0  1.5 1115920 256216 ?      Ssl  Feb13   0:31 /opt/julia-1.1.0/bin/julia --project=/opt/seismo-api-gm-concave-module.jl --load=/opt/seismo-api-gm-concave-mod
juliaus+  6756  2.0 36.4 7476288 5984652 ?     Ssl  Feb13  91:55  \_ /opt/julia-1.1.0/bin/julia -Cnative -J/opt/julia-1.1.0/lib/julia/sys.so -g1 --bind-to 127.0.0.1 --worker
juliaus+  6759  2.0 36.3 7317696 5978164 ?     Ssl  Feb13  91:46  \_ /opt/julia-1.1.0/bin/julia -Cnative -J/opt/julia-1.1.0/lib/julia/sys.so -g1 --bind-to 127.0.0.1 --worker

Why workers take so much memory after a lot of jobs? And why varinfo() does not show objects that takes so much memory?

#2

I found this issue: https://github.com/JuliaLang/julia/issues/28887

I don’t know what is lazy worker, but it similar problem.