Revise.jl with Distributed and pmap -- why does this behavior not work?

# ModuleA.jl
module ModuleA

using Distributed

function foo()
    # a = 5
    r = pmap((x -> myid()), fill(1, 10))
end
export foo

end

First I run this:

using Distributed

addprocs(4)
@everywhere push!(LOAD_PATH, pwd())
@everywhere using ModuleA

foo() # with "a = 5" commented in ModuleA

As expected, I get something that looks like this:

10-element Array{Int64,1}:
 2
 3
 5
 4
 2
 2
 2
 2
 2
 2

Then I go into ModuleA.jl and uncomment the line a = 5.
That gives me this:

julia> foo()
ERROR: On worker 2:
UndefVarError: ##5#6 not defined
deserialize_datatype at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:1115
handle_deserialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:771
deserialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:731
handle_deserialize at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:778
deserialize_msg at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Serialization/src/Serialization.jl:731
#invokelatest#1 at ./essentials.jl:742 [inlined]
invokelatest at ./essentials.jl:741 [inlined]
message_handler_loop at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:160
process_tcp_streams at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/process_messages.jl:117
#105 at ./task.jl:259
Stacktrace:
 [1] (::getfield(Base, Symbol("##696#698")))(::Task) at ./asyncmap.jl:178
 [2] foreach(::getfield(Base, Symbol("##696#698")), ::Array{Any,1}) at ./abstractarray.jl:1866
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Array{Int64,1}) at ./asyncmap.jl:178
 [4] #async_usemap#681 at ./asyncmap.jl:154 [inlined]
 [5] #async_usemap at ./none:0 [inlined]
 [6] #asyncmap#680 at ./asyncmap.jl:81 [inlined]
 [7] #asyncmap at ./none:0 [inlined]
 [8] #pmap#215(::Bool, ::Int64, ::Nothing, ::Array{Any,1}, ::Nothing, ::Function, ::Function, ::WorkerPool, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/pmap.jl:126
 [9] pmap(::Function, ::WorkerPool, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/pmap.jl:101
 [10] #pmap#225(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/pmap.jl:156
 [11] pmap at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Distributed/src/pmap.jl:156 [inlined]
 [12] foo() at /home/at/pantheon/scratchspace/julia-revise-parallel/ModuleA.jl:7
 [13] top-level scope at none:0

What’s going on here? I have read this part of the Revise.jl documentation, but it doesn’t seem to apply here. I think maybe I am misunderstanding something fundamental about how Revise.jl is meant to work in this setting.

My versions:

(v1.1) pkg> status Revise
    Status `~/.julia/environments/v1.1/Project.toml`
  [295af30f] Revise v2.4.0
2 Likes

Kudos for finding that documentation section. From your results, it looks like it needs to be a bit more expansive; it seems that any usage of an anonymous function by a function you modify is problematic in a distributed context.

You are using a fairly old version of Revise, but I don’t think there have been any changes since then that should affect this issue.

Here’s a brief explanation of why this result makes sense, even though it’s not desirable. Each function has to have a type in Julia. Normally this type is defined in terms of the function name. An anonymous function doesn’t have a name associated with it (by definition), so Julia will make up a name for it using the gensym mechanism. To ensure that you never repeat names, gensym works by increasing a counter. Interestingly, there’s a separate counter on each worker; the consequence is that the types, in terms of their names, are not synchronized across workers. When you redefine the method, it will try to delete any old methods associated with the old gensym and create new types. Because of the communication between workers and the type name mismatch, you get an error.

The advice in that documentation should still fix the problem; if you can define/use a named function, the error should go away.

1 Like

Thank you for the clear explanation.

I tried this as a solution, but it did not work:

module ModuleA

using Distributed

function foo()
    a = 5
    returnid = function(x)
        myid()
    end
    r = pmap(returnid, fill(1, 10))
end
export foo

end

Is returnid still an “anonymous” function because it is defined inside foo?

Then I tried this, which, as you suggested, works as expected.

module ModuleA

using Distributed

function foo()
    a = 5
    r = pmap(returnid, fill(1, 10))
end
export foo

function returnid(x)
    myid()
end

end

I guess the annoying thing for my purposes is that my primary use case is where I have a function that takes many arguments, but I only want to change one of them (the seed for the random number generator) on each pmap call.

What’s the best solution for a case like this? Do I need to pass pmap an array where each element includes not just the varying seed but also the invariant other arguments?

For example, this is closer to what I would like to do:

module ModuleA

using Distributed

function foo()
    y = # something complicated
    f = (x -> my_complicated_function(x, y))
    r = pmap(f, fill(1, 10))
end
export foo

function my_complicated_function(x, y)
    # do complicated stuff that depends on both x and y
end

end

(In practice, my_complicated_function would be defined in a separate module.)

I understand your annoyance, it annoys me to even though I’m a pretty infrequent user of Distributed.

It’s possible it can be improved further; if you’re interested in giving it a whirl, my most recent commit that attempted to make this better is Ensure anonymous functions synced on workers (Julia 1.3+) (#402) · timholy/Revise.jl@63ff66f · GitHub. That should give you some hints about where to start debugging.

1 Like

I don’t think I have the background to contribute to this, unfortunately.
So I will just make do with this (relatively minor) annoyance using the workaround above.
Thanks again for the help!

I’m experiencing exactly the same issue. It’s an unfortunate one to track down because there’s so much going on across remote workers and states that it took a while for me to realize that Revise could be the culprit. Could Revise somehow warn about this being a Revise-based issue when it occurs?

Also, since this post, has there been any change in Revise on the core issue? I have the same need to run a function like @torgo’s

    y = # something complicated
    f = (x -> my_complicated_function(x, y))

I don’t think so; anonymous functions are just plain hard. What if the name assigned in one process has already been assigned to something else in another process?

However, it occurs to me that perhaps Revise could intervene and control the name assigned to the anonymous function. This is essentially “taking over” lowering for anonymous functions. If someone files a MWE to Revise I can add it as a broken test, to act as incentive to try to fix this.

2 Likes