How to solve error while running parallel sampling in Julia with Turing and Distributed package?

Hi,
I am trying to run samplers in parallel by using example code in Turing documentation (Parallel Sampling). When I run the following example, I am getting an error. The example is as follows:

using Distributed
addprocs(4)
@everywhere using Turing, Random

# Set the progress to false to avoid too many notifications.
@everywhere turnprogress(false)

# Set a different seed for each process.
for i in procs()
    @fetch @spawnat i Random.seed!(rand(Int64))
end

# Define the model using @everywhere.
@everywhere @model gdemo(x) = begin
    s ~ InverseGamma(2, 3)
    m ~ Normal(0, sqrt(s))
    for i in eachindex(x)
        x[i] ~ Normal(m, sqrt(s))
    end
end

# Sampling setup.
num_chains = 4
sampler = NUTS(0.65)
model = gdemo([1.2, 3.5])

# Run all samples.
chns = reduce(chainscat, pmap(x->sample(model, sampler, 1000), 1:num_chains))

and the error as follows:

On worker 2:
UndefVarError: model not defined
#21 at D:\JULIA\TURING_LSPECT\test_parallel.jl:28
#108 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\Distributed\src\process_messages.jl:294
run_work_thunk at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\Distributed\src\process_messages.jl:79
macro expansion at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\Distributed\src\process_messages.jl:294 [inlined]
#107 at .\task.jl:333
(::Base.var"#732#734")(::Task) at asyncmap.jl:178
foreach(::Base.var"#732#734", ::Array{Any,1}) at abstractarray.jl:1920
maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::UnitRange{Int64}) at asyncmap.jl:178
wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Distributed.var"#208#211"{WorkerPool}, ::Function, ::UnitRange{Int64}) at asyncmap.jl:154
#async_usemap#717(::Function, ::Nothing, ::typeof(Base.async_usemap), ::Distributed.var"#192#194"{Distributed.var"#192#193#195"{WorkerPool,var"#21#22"}}, ::UnitRange{Int64}) at asyncmap.jl:103
(::Base.var"#kw##async_usemap")(::NamedTuple{(:ntasks, :batch_size),Tuple{Distributed.var"#208#211"{WorkerPool},Nothing}}, ::typeof(Base.async_usemap), ::Function, ::UnitRange{Int64}) at none:0
#asyncmap#716 at asyncmap.jl:81 [inlined]
#asyncmap at none:0 [inlined]
#pmap#207(::Bool, ::Int64, ::Nothing, ::Array{Any,1}, ::Nothing, ::typeof(pmap), ::Function, ::WorkerPool, ::UnitRange{Int64}) at pmap.jl:126
pmap(::Function, ::WorkerPool, ::UnitRange{Int64}) at pmap.jl:101
#pmap#217(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(pmap), ::Function, ::UnitRange{Int64}) at pmap.jl:156
pmap(::Function, ::UnitRange{Int64}) at pmap.jl:156
top-level scope at test_parallel.jl:28

How can I solve this error? My julia version is 1.3 and turing is Turing v0.8.3

Thanks in Advance !
Manu

Hi!

This works:

using Distributed
addprocs(4)
@everywhere using Turing, Random

# Set a different seed for each process.
for i in procs()
    fetch(@spawnat i Random.seed!(rand(Int64)))
end

@everywhere begin
    # Set the progress to false to avoid too many notifications.
    turnprogress(false)

    # Define the model.
    @model gdemo(x) = begin
        s ~ InverseGamma(2, 3)
	m ~ Normal(0, sqrt(s))
	for i in eachindex(x)
	    x[i] ~ Normal(m, sqrt(s))
	end
    end

    # Sampling setup.
    alg = NUTS(0.65)
    model = gdemo([1.2, 3.5])
    chn = sample(model, alg, 100, save_state = false)
end

chns = mapreduce(chainscat, procs()) do i
    fetch(@spawnat i chn)
end

Note that with addprocs(4) you actually have 5 parallel processes running.

@cpfiffer it seems that save_state = true leads to a similar serialization error as in Loading saved Chain object fails · Issue #1091 · TuringLang/Turing.jl · GitHub. pmap with closures and even @spawnat with local variables which should get sent over seem to be causing issues the first time I run them, but they work the second time. This seems like a Julia issue.

1 Like

I’ll just open a PR to disable the default state saving. It’s causing an awful lot of trouble.

@mohamed82008 : Thanks a lot for your reply. This solution works most of the times. But sometime, when I run the following section in above code, I am getting error.

for i in procs()
    fetch(@spawnat i Random.seed!(rand(Int64)))
end

The error is as follows.

On worker 2:
DomainError with -4990743955205970348:
`n` must be non-negative.
make_seed at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\Random\src\RNGs.jl:266
seed! at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\Random\src\RNGs.jl:289 [inlined]
#9 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\Distributed\src\macros.jl:87
#105 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\Distributed\src\process_messages.jl:290
run_work_thunk at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\Distributed\src\process_messages.jl:79
run_work_thunk at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\Distributed\src\process_messages.jl:88
#98 at .\task.jl:333
#remotecall_fetch#145 at remotecall.jl:390 [inlined]
remotecall_fetch(::Function, ::Distributed.Worker, ::Distributed.RRID) at remotecall.jl:382
#remotecall_fetch#148 at remotecall.jl:417 [inlined]
remotecall_fetch at remotecall.jl:417 [inlined]
call_on_owner at remotecall.jl:490 [inlined]
fetch(::Future) at remotecall.jl:529
top-level scope at test_parallel.jl:7

Thanks again for spending your valuable time to solve my issue.

Thanking you,
Manu

Yes that’s a bug in the docs, use abs(rand(Int)). And while you are at it, try making a PR to fix the docs.

1 Like

@mohamed82008: Now everything works fine… thank you :slightly_smiling_face:

1 Like