Strange deadlock in 1.7.0-beta3 didn't happen in 1.5.x

I have an Agents.jl simulation that worked fine last summer… I’m back to looking at the code, and it’s deadlocking and sitting there using 0% of cpu.

It’s a script that uses Threads.@spawn to run some threads each of which reads a simulation description from a channel, runs the simulation, and then put! some summary statistics to a statistics channel. In a separate single thread a function reads from the stats channel and writes stats to a SQLite database.

As I said, all of this worked fine last summer on some version like 1.5.2 or something along those lines. Today in 1.7.0-beta3 (or beta2) it gets to the point where the worker threads have each read a simulation description (and print out what it is) and where the SQLite writer says it’s about to grab the next DB result… but at that point, everything deadlocks and 0% CPU usage.

It doesn’t appear that the ABM runs at all. Here is the code the threads are running. One thread is spawned to run sqliteinserter, and 5 threads are spawned to run runasim:


function sqliteinserter(dfchannel,dbfile)
    
    db = SQLite.DB(dbfile);
    SQLite.drop!(db,"sims";ifexists=true)
    SQLite.drop!(db,"stats";ifexists=true)
    SQLite.createtable!(db,"sims",Tables.Schema((:simid,:n,:density,:infprob,:testevery,:poolsize,
                                                 :quarduration,:costoftest,:limitdet),
                                                (Int,Int,Float64,Float64,Int,Int,Int,Float64,Float64)));
    
    SQLite.createtable!(db,"stats",Tables.Schema((:simid,:day,:ninfected,:nquarantined,:tottestcost),
                                                 (Int,Int,Int,Int,Float64)));
    
    while true
        println("about to grab next results for DB.")
        tup = take!(dfchannel)
        if tup === nothing
            return
        end
        iter,adf,mdf,pars = tup;
        println("Completed iteration $iter, adding to SQLite database");
        cmd = "insert into sims values(" * Printf.@sprintf("%d,%d,%f,%f,%d,%d,%d,%f,%f);",pars...)
        DBInterface.execute(db,cmd);
        
        DBInterface.execute(db,"drop table if exists tempstats;")
        SQLite.load!(mdf,db,"tempstats",temp=true,ifnotexists=true)
        
        DBInterface.execute(db,"insert into stats select $iter,day,ninfected,nquarantined,tottestcost from tempstats;")
        
    end
end

function runasim(simchannel,dfchannel)
    while true
        tup = take!(simchannel)
        println("Got new sim $tup")
        if tup === nothing
            println("Done with thread $(Threads.threadid())")
            return
        end
        iter,sim = tup;
        testmodel = covid_model(;sim...)
        println("About to run model!")
        adf,mdf = run!(testmodel,step_covperson!,stepcovmodel!,365; adata=[],mdata=[:day,:ninfected,:nquarantined,:tottestcost])
        println("Finished sim $tup")
        put!(dfchannel,(iter,adf,mdf,(simid=iter,sim...)))
    end
    
end

I see printed messages like:

Got new sim (1, (n = 10000, density = 7.957747154594767, infprob = 0.008, testevery = 14, poolsize = 1, quarduration = 14, costoftest = 120.0, limitdet = 1000.0))

but I never see

About to run model!

So at some point after it takes the message from its channel, and before it gets to the println for that message… everything halts.

I’m pretty sure that’s not me doing something stupid, and the tasks aren’t printing the exit message. I don’t think I’m doing anything wrong here, so it feels like there’s a bug. Also it all worked a year ago.
WDYT?

How exactly are the threads started? Seems to be missing from your code, I suspect that’s where the deadlock is coming from.

Another reason could be that there’s an error happening in the spawned task in covid_model, so having that available would help as well.


Aside, I’d recommend not using string concatenation to form queries, as that will lead to SQL injection should that code/those values be user accessible. Something like this should to what you want:

DBInterface.execute(db, "insert into sims values (?,?,?,?,?,?,?,?,?)", pars)

There is a main script where I spawn the threads to run the functions mentioned here. As I said, the Debug print messages do print out, so basically something deadlocks between those statements, ill find covid_model and get back to you.

None of this is “user accessible” as its just me running different scenarios, but I will look at the query with parameters anyway for future use. Thanks!

1 Like

There doesn’t appear to be much here that could cause a deadlock… but here’s the covid_model function, it just initializes an ABM model.


function covid_model(;n=10000,density=25/pi,infprob=3e-3,testevery=4,poolsize=20,quarduration=4,costoftest=10,limitdet=10000.0)
    s = sqrt(n/density);
    space2d = ContinuousSpace(2;periodic=true,extend=(s,s))
    model = ABM(COVPerson,space2d,properties=Dict(:poolsize => poolsize,:testevery=>testevery,
                                                  :day=>0,:quarduration => quarduration,
                                                  :ninfected => 0,
                                                  :nquarantined => 0,
                                                  :tottestcost => 0, :costoftest => costoftest,
                                                  :limitdet => limitdet,
                                                  :testagents => NTuple{testevery,Array{COVPerson,1}}(NTuple{testevery,Array{COVPerson,1}}([Array{COVPerson,1}() for i in 1:n]))))
    for i in model.testagents
        sizehint!(i,round(Int,n/testevery+100))
    end
    for i in 1:n
        pos = NTuple{2,Float64}(rand(2)*s)
        vel = sincos(2*pi*rand());
        agent = COVPerson(i,
                          (rand(),rand()),
                          vel,
                          rand() < infprob,
                          0,
                          exp(rand()*(log(1e9)-log(1e4)) + log(1e4)),
                          0.0,
                          false,
                          false,
                          0,
                          rand() < .6,
                          i % testevery)
        add_agent_pos!(agent,
                       model)
        push!(model.testagents[agent.testingDay+1], agent)
        
    end
    index!(model)
    return model
end

Thank you - what about the task creation using Threads.@spawn? Your original code indicated that there were channels involved.

Can you share your full code? I still can’t copy and paste it into my REPL to find out whether it deadlocks on my machine as well.

I can’t give you the details of the model parts… but I can give you the rest. Since it deadlocks before it starts running the model, it shouldn’t matter… but you know how “shouldn’t” works with computers :wink:

using Pkg
Pkg.activate(".")
#Pkg.add(["Agents","AgentsPlots","SQLite","Tables"])
using SQLite,Tables,Printf
include("Covidabm.jl")

println("Running a data collection script for Campus ABM using $(Threads.nthreads()) threads")

function sqliteinserter(dfchannel,dbfile)
    
    db = SQLite.DB(dbfile);
    SQLite.drop!(db,"sims";ifexists=true)
    SQLite.drop!(db,"stats";ifexists=true)
    SQLite.createtable!(db,"sims",Tables.Schema((:simid,:n,:density,:infprob,:testevery,:poolsize,
                                                 :quarduration,:costoftest,:limitdet),
                                                (Int,Int,Float64,Float64,Int,Int,Int,Float64,Float64)));
    
    SQLite.createtable!(db,"stats",Tables.Schema((:simid,:day,:ninfected,:nquarantined,:tottestcost),
                                                 (Int,Int,Int,Int,Float64)));
    
    while true
        println("about to grab next results for DB.")
        tup = take!(dfchannel)
        if tup === nothing
            return
        end
        iter,adf,mdf,pars = tup;
        println("Completed iteration $iter, adding to SQLite database");
        cmd = "insert into sims values(" * Printf.@sprintf("%d,%d,%f,%f,%d,%d,%d,%f,%f);",pars...)
        DBInterface.execute(db,cmd);
        
        DBInterface.execute(db,"drop table if exists tempstats;")
        SQLite.load!(mdf,db,"tempstats",temp=true,ifnotexists=true)
        
        DBInterface.execute(db,"insert into stats select $iter,day,ninfected,nquarantined,tottestcost from tempstats;")
        
    end
end


function runasim(simchannel,dfchannel)
    while true
        tup = take!(simchannel)
        println("Got new sim $tup")
        if tup === nothing
            println("Done with thread $(Threads.threadid())")
            return
        end
        iter,sim = tup;
        testmodel = covid_model(;sim...)
        println("About to run model!")
        adf,mdf = run!(testmodel,step_covperson!,stepcovmodel!,365; adata=[],mdata=[:day,:ninfected,:nquarantined,:tottestcost])
        println("Finished sim $tup")
        put!(dfchannel,(iter,adf,mdf,(simid=iter,sim...)))
    end
    
end


"""

runsims runs a set of simulations, n repetitions with each of the
paramsets, and saves the results to the SQLite database given by
dbfile which is a string

"""
function runsims(dbfile,paramsets,n)

    nt = Threads.nthreads()
    dfchannel = Channel(nt)
    dbthr = Threads.@spawn sqliteinserter(dfchannel,dbfile)
    allthrds = [];

    simchannel = Channel(nt)
    
    for i in 1:nt
        t = Threads.@spawn runasim(simchannel,dfchannel)
        push!(allthrds,t)
    end
    
    iter = 0;
    for i in 1:length(paramsets)
        for j in 1:n
            iter += 1;
            put!(simchannel,(iter,paramsets[i]))
        end
    end
    for i in allthrds
        put!(simchannel,nothing)
    end
    for i in allthrds
        wait(i)
    end
    put!(dfchannel,nothing)
    wait(dbthr)
end


const simconfigs = [
(n=10000,density=25/pi,infprob=8e-3,testevery=14,poolsize=1,quarduration=14,costoftest=120.0,limitdet=1000.0),
(n=10000,density=25/pi,infprob=8e-3,testevery=30,poolsize=1,quarduration=14,costoftest=120.0,limitdet=1000.0),
(n=10000,density=25/pi,infprob=8e-3,testevery=60,poolsize=1,quarduration=14,costoftest=120.0,limitdet=1000.0),
(n=10000,density=25/pi,infprob=8e-3,testevery=3,poolsize=20,quarduration=4,costoftest=10.0,limitdet=20000.0),
(n=10000,density=25/pi,infprob=8e-3,testevery=7,poolsize=20,quarduration=8,costoftest=10.0,limitdet=20000.0),
(n=10000,density=25/pi,infprob=8e-3,testevery=10,poolsize=20,quarduration=11,costoftest=10.0,limitdet=20000.0),
(n=10000,density=25/pi,infprob=8e-3,testevery=14,poolsize=20,quarduration=15,costoftest=10.0,limitdet=20000.0),
(n=10000,density=25/pi,infprob=8e-3,testevery=21,poolsize=20,quarduration=15,costoftest=10.0,limitdet=20000.0)
]

runsims("testresults.db",simconfigs,10)



For Covidabm.jl you can have this:

using Agents,AgentsPlots,Distributions,Random

mutable struct COVPerson <: AbstractAgent
    id::Int
    pos::NTuple{2,Float64}
    vel::NTuple{2,Float64}
    infected::Bool
    infectDay::Int
    virconcmax::Float64
    virconc::Float64
    recov::Bool
    quarantined::Bool
    qdays::Int
#    color::Color
    sympt::Bool # whether the person can be symptomatic
    testingDay::Int
end

### some utility functions etc elided

function stepcovmodel!(model)
   # can't give you this code
   return nothing
end


function step_covperson!(agent::COVPerson,model)
   ## can't give you this code
   return nothing
end


function covid_model(;n=10000,density=25/pi,infprob=3e-3,testevery=4,poolsize=20,quarduration=4,costoftest=10,limitdet=10000.0)
    s = sqrt(n/density);
    space2d = ContinuousSpace(2;periodic=true,extend=(s,s))
    model = ABM(COVPerson,space2d,properties=Dict(:poolsize => poolsize,:testevery=>testevery,
                                                  :day=>0,:quarduration => quarduration,
                                                  :ninfected => 0,
                                                  :nquarantined => 0,
                                                  :tottestcost => 0, :costoftest => costoftest,
                                                  :limitdet => limitdet,
                                                  :testagents => NTuple{testevery,Array{COVPerson,1}}(NTuple{testevery,Array{COVPerson,1}}([Array{COVPerson,1}() for i in 1:n]))))
    for i in model.testagents
        sizehint!(i,round(Int,n/testevery+100))
    end
    for i in 1:n
        pos = NTuple{2,Float64}(rand(2)*s)
        vel = sincos(2*pi*rand());
        agent = COVPerson(i,
                          (rand(),rand()),
                          vel,
                          rand() < infprob,
                          0,
                          exp(rand()*(log(1e9)-log(1e4)) + log(1e4)),
                          0.0,
                          false,
                          false,
                          0,
                          rand() < .6,
                          i % testevery)
        add_agent_pos!(agent,
                       model)
        push!(model.testagents[agent.testingDay+1], agent)
        
    end
    index!(model)
    return model
end

function agentcolor(agent)
    if agent.infected
        return("red")
    else
        return("blue")
    end
end


Other than all tasks sharing the same input channel, I don’t see anything obviously wrong with it. What may be happening is that all tasks are running even in the main thread, where usually output of that kind would be printed. If the main put! task doesn’t get scheduled again (though it should after a while, once all other tasks are blocked by take!), you’d have a deadlock. You could prevent this by only starting the worker tasks once you’ve pushed all work into the channels.

I’d either try giving each task its own Channel or using the ensemble! functionality from Agents.jl instead.

Yeah, and remember this worked fine last year. Also we do know that the put happens and the take happens, because inside the worker thread it’s printing out the line “Got new sim …” with the details of what it took from the channel.

I explicitly want the threads to read from the same channel, because some simulations take longer than others and I don’t want to pre-partition the simulations because then some threads can finish and be doing nothing when instead they could be running simulations.

The weird part is that it just locks up… whereas last year all of this ran fine. I suspect it’s some kind of scheduling bug maybe in 1.7?

Unfortunately I can’t test on 1.6.2 because the registry operations take literally eons on earlier versions.

Does this not deadlock for you?

1.7 introduces task migration between OS threads, so I suspect that has something to do with it? See this PR:

https://github.com/JuliaLang/julia/pull/40715

Right, exactly. it seems like maybe a scheduling/migration related problem, but I don’t know where to even begin debugging that. I would be happy to file an issue if only I knew what to file.

Perhaps @jameson since he was the one who was working on that PR?

When I ctrl-c the hung task… I get this backtrace:

signal (2): Interrupt
in expression starting at /home/dlakelan/Consulting/LakelandAppliedSciLLC/Hadi/covidpooltests/notebook/covidabmdatacollection.jl:110
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:847
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:512
poptask at ./task.jl:814
wait at ./task.jl:822 [inlined]
task_done_hook at ./task.jl:531
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2245 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2427
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1787 [inlined]
jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:217
start_task at /buildworker/worker/package_linux64/build/src/task.c:889
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:847
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:512
poptask at ./task.jl:814
wait at ./task.jl:822 [inlined]
task_done_hook at ./task.jl:531
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2245 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2427
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1787 [inlined]
jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:217
start_task at /buildworker/worker/package_linux64/build/src/task.c:889
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:847
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:512
poptask at ./task.jl:814
wait at ./task.jl:822 [inlined]
task_done_hook at ./task.jl:531
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2245 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2427
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1787 [inlined]
jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:217
start_task at /buildworker/worker/package_linux64/build/src/task.c:889
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:847
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:512
poptask at ./task.jl:814
wait at ./task.jl:822 [inlined]
wait at ./condition.jl:112
take_buffered at ./channels.jl:389
take! at ./channels.jl:383 [inlined]
sqliteinserter at /home/dlakelan/Consulting/LakelandAppliedSciLLC/Hadi/covidpooltests/notebook/covidabmdatacollection.jl:23
#6 at ./threadingconstructs.jl:178
unknown function (ip: 0x7fef1c0e6f0f)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2245 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2427
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1787 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:878
unknown function (ip: (nil))
epoll_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__io_poll at /workspace/srcdir/libuv/src/unix/linux-core.c:309
uv_run at /workspace/srcdir/libuv/src/unix/core.c:379
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:476
poptask at ./task.jl:814
wait at ./task.jl:822 [inlined]
task_done_hook at ./task.jl:531
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2245 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2427
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1787 [inlined]
jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:217
start_task at /buildworker/worker/package_linux64/build/src/task.c:889
unknown function (ip: (nil))
Allocations: 37623172 (Pool: 37610730; Big: 12442); GC: 31

Could it be that the SQLite writer task and all the other tasks are all on the same OS thread, and NOT migrating (even though I’m in 1.7-beta3)… the SQLite writer task is the active task, it does the take! on its channel… but its channel doesn’t have any results yet… so it blocks, but then somehow the scheduler never decides to run the actual simulation tasks, so they don’t complete… and we’re just left doing wait in the main task for threads that are never scheduled?

If so, how can I determine what’s going on in the scheduler and how to debug this?