Dear all,
I am having an issue importing a Python package (with PyCall) on the lsf cluster of my university.
I want to perform an MCMC analysis using Turing on the Cluster. I have been able to use Turing on my local machine easily and (at the beginning) I was using it on the cluster without problems (when using a very low number of MCMC steps, e.g. 10 steps).
However, it turns out that when I require more than ~ 20 steps, one of the workers reports the following error
β Error: Fatal error on process 4
β exception =
β PyError ($(Expr(:escape, :(ccall(#= /home/mbonici/.julia/packages/PyCall/BD546/src/pyfncall.jl:43 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'TypeError'>
β TypeError("can't pickle traceback objects",)
| Stacktrace:
β [1] pyerr_check
β @ ~/.julia/packages/PyCall/BD546/src/exception.jl:62 [inlined]
β [2] pyerr_check
β @ ~/.julia/packages/PyCall/BD546/src/exception.jl:66 [inlined]
β [3] _handle_error(msg::String)
β @ PyCall ~/.julia/packages/PyCall/BD546/src/exception.jl:83
β [4] macro expansion
β @ ~/.julia/packages/PyCall/BD546/src/exception.jl:97 [inlined]
β [5] #107
β @ ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:43 [inlined]
β [6] disable_sigint
β @ ./c.jl:458 [inlined]
β [7] __pycall!
β @ ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:42 [inlined]
β [8] _pycall!(ret::PyCall.PyObject, o::PyCall.PyObject, args::Tuple{PyCall.PyObject}, nargs::Int64, kw::Ptr{Nothing})
β @ PyCall ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:29
β [9] _pycall!
β @ ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:11 [inlined]
β [10] #pycall#112
β @ ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:80 [inlined]
β [11] pycall
β @ ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:80 [inlined]
β [12] serialize(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, pyo::PyCall.PyObject)
β @ PyCall ~/.julia/packages/PyCall/BD546/src/serialize.jl:14
β [13] serialize_any(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, x::Any)
β @ Serialization /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:657
β [14] serialize(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, x::Any)
β @ Serialization /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:636
β [15] serialize(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, ex::CapturedException)
β @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/clusterserialize.jl:192
β [16] serialize_any(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, x::Any)
β @ Serialization /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:657
β [17] serialize(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, x::Any)
β @ Serialization /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:636
β [18] serialize_msg(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, o::Distributed.ResultMsg)
β @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:78
β [19] #invokelatest#2
β @ ./essentials.jl:708 [inlined]
β [20] invokelatest
β @ ./essentials.jl:706 [inlined]
β [21] send_msg_(w::Distributed.Worker, header::Distributed.MsgHeader, msg::Distributed.ResultMsg, now::Bool)
β @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:174
β [22] send_msg_now
β @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:118 [inlined]
β [23] send_msg_now(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
β @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:113
β [24] deliver_result(sock::Sockets.TCPSocket, msg::Symbol, oid::Distributed.RRID, value::RemoteException)
β @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:95
β [25] macro expansion
β @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:286 [inlined]
β [26] (::Distributed.var"#105#107"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
β @ Distributed ./task.jl:411
β @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:99
What I find peculiar, is that this error is not thrown at the very beginning and this is thrown only by a single worker. When I run the first ~ 10 steps, I have no problem at all. Then, when I run the chains with more steps, one of the process gives this error (the other ones donβt throw any error).
I donβt know what is the problem.
I have been able to run an example from Turing on the Cluster without problems (using both NUTS and MH).
What could be the origin of the problem? I am using a python wrapper of a C code (https://github.com/lesgourg/class_public). My cluster is a lsf cluster. The error looks to come from PyCall, but I donβt know how to handle it.
I can give more details on my code, of course. I didnβt post something since I didnβt know which part could be more relevant (the MCMC part, the inclusion of the Python code, etc.).
Thank you for your time,
Marco