PyCall error on lsf cluster

Dear all,
I am having an issue importing a Python package (with PyCall) on the lsf cluster of my university.

I want to perform an MCMC analysis using Turing on the Cluster. I have been able to use Turing on my local machine easily and (at the beginning) I was using it on the cluster without problems (when using a very low number of MCMC steps, e.g. 10 steps).
However, it turns out that when I require more than ~ 20 steps, one of the workers reports the following error

β”Œ Error: Fatal error on process 4
β”‚   exception =
β”‚    PyError ($(Expr(:escape, :(ccall(#= /home/mbonici/.julia/packages/PyCall/BD546/src/pyfncall.jl:43 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'TypeError'>
β”‚    TypeError("can't pickle traceback objects",)
|    Stacktrace:
β”‚      [1] pyerr_check
β”‚        @ ~/.julia/packages/PyCall/BD546/src/exception.jl:62 [inlined]
β”‚      [2] pyerr_check
β”‚        @ ~/.julia/packages/PyCall/BD546/src/exception.jl:66 [inlined]
β”‚      [3] _handle_error(msg::String)
β”‚        @ PyCall ~/.julia/packages/PyCall/BD546/src/exception.jl:83
β”‚      [4] macro expansion
β”‚        @ ~/.julia/packages/PyCall/BD546/src/exception.jl:97 [inlined]
β”‚      [5] #107
β”‚        @ ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:43 [inlined]
β”‚      [6] disable_sigint
β”‚        @ ./c.jl:458 [inlined]
β”‚      [7] __pycall!
β”‚        @ ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:42 [inlined]
β”‚      [8] _pycall!(ret::PyCall.PyObject, o::PyCall.PyObject, args::Tuple{PyCall.PyObject}, nargs::Int64, kw::Ptr{Nothing})
β”‚        @ PyCall ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:29
β”‚      [9] _pycall!
β”‚        @ ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:11 [inlined]
β”‚     [10] #pycall#112
β”‚        @ ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:80 [inlined]
β”‚     [11] pycall
β”‚        @ ~/.julia/packages/PyCall/BD546/src/pyfncall.jl:80 [inlined]
β”‚     [12] serialize(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, pyo::PyCall.PyObject)
β”‚        @ PyCall ~/.julia/packages/PyCall/BD546/src/serialize.jl:14
β”‚     [13] serialize_any(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, x::Any)
β”‚        @ Serialization /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:657
β”‚     [14] serialize(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, x::Any)
β”‚        @ Serialization /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:636
β”‚     [15] serialize(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, ex::CapturedException)
β”‚        @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/clusterserialize.jl:192
β”‚     [16] serialize_any(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, x::Any)
β”‚        @ Serialization /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:657
β”‚     [17] serialize(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, x::Any)
β”‚        @ Serialization /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Serialization/src/Serialization.jl:636
β”‚     [18] serialize_msg(s::Distributed.ClusterSerializer{Sockets.TCPSocket}, o::Distributed.ResultMsg)
β”‚        @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:78
β”‚     [19] #invokelatest#2
β”‚        @ ./essentials.jl:708 [inlined]
β”‚     [20] invokelatest
β”‚        @ ./essentials.jl:706 [inlined]
β”‚     [21] send_msg_(w::Distributed.Worker, header::Distributed.MsgHeader, msg::Distributed.ResultMsg, now::Bool)
β”‚        @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:174
β”‚     [22] send_msg_now
β”‚        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:118 [inlined]
β”‚     [23] send_msg_now(s::Sockets.TCPSocket, header::Distributed.MsgHeader, msg::Distributed.ResultMsg)
β”‚        @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/messages.jl:113
β”‚     [24] deliver_result(sock::Sockets.TCPSocket, msg::Symbol, oid::Distributed.RRID, value::RemoteException)
β”‚        @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:95
β”‚     [25] macro expansion
β”‚        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:286 [inlined]
β”‚     [26] (::Distributed.var"#105#107"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
β”‚        @ Distributed ./task.jl:411
β”” @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:99

What I find peculiar, is that this error is not thrown at the very beginning and this is thrown only by a single worker. When I run the first ~ 10 steps, I have no problem at all. Then, when I run the chains with more steps, one of the process gives this error (the other ones don’t throw any error).
I don’t know what is the problem.
I have been able to run an example from Turing on the Cluster without problems (using both NUTS and MH).

What could be the origin of the problem? I am using a python wrapper of a C code (GitHub - lesgourg/class_public: Public repository of the Cosmic Linear Anisotropy Solving System (mast). My cluster is a lsf cluster. The error looks to come from PyCall, but I don’t know how to handle it.

I can give more details on my code, of course. I didn’t post something since I didn’t know which part could be more relevant (the MCMC part, the inclusion of the Python code, etc.).
Thank you for your time,
Marco

It appears that a Python exception has occurred and is being serialised (presumably to send it to the master node), and PyCall uses pickle to serialise it, but pickle does not support tracebacks.

One option is to use tblib Β· PyPI to allow pickling tracebacks.

Another option is to modify PyCall to skip the traceback when serialising a PyError.

BTW PythonCall will have the same issue, I’ll make a ticket to fix it.

1 Like