How to serialize a Julia function and define that function in another process?


#1

Are there updates following this thread?
https://groups.google.com/forum/#!topic/julia-users/B39vRfhKFxM

My idea is to obtain the lowered or typed LambdaInfo, serialize and send that to the other process, and construct a :method Expr there and eval it. Is there already a way to serialize the LambdaInfo object? It seems the AST is already compressed.

Secondly – it might be stupid to ask – how do I create a SimpleVector? I can’t find its constructor.

Would it be a better/simpler idea if I simply send the the source string over and eval in the other process?


#2

In the current circumstances, I’d bet on macro_form from Sugar.jl. It tries to parse a file of a function and extract it’s definition exactly as it’s written or otherwise obtains and cleans lowered code.


#3

Thanks. This seems nice.

One more question though, macro_form requires the types argument to be concrete types. I guess this is needed for precompile? But why is precompile necessary? Is there a way to get around it? Determining the concrete types for the function seems a bit challenging for me.

Secondly, in certain cases, I might need to pattern match and slightly revise the function’s AST before sending it to another process. So it might be inevitable for me to be able to serialize the AST. Any suggestions?


#4

If your types are abstract, there might be more than one method conforming the signature:

julia> methods(log, (Number, Number)) 
 # 3 methods for generic function "log": 
 log(::Irrational{:e}, x::Number) in Base at irrationals.jl:218 
 log(b::T, x::T) where T<:Number in Base.Math at math.jl:159 
 log(b::Number, x::Number) in Base.Math at math.jl:186

which one do you need? Without further information, it’s impossible to decide on the correct method to return. If you really need to go with abstract types, you may try to create an alternative macro_form that returns ASTs for all conforming methods and then serialize them.

So it might be inevitable for me to be able to serialize the AST.

Not sure I understand you, but expression you get with macro_form is exactly an abstract syntax tree of that function.

For expression pattern matching consider MacroTools.jl or rewrite utils from Espresso.jl.


#5

Thanks, Andrei.

which one do you need? Without further information, it’s impossible to decide on the correct method to return. If you really need to go with abstract types, you may try to create an alternative macro_form that returns ASTs for all conforming methods and then serialize them.

I see your point. I was actually intentionally looking for an exact match, a.k.a. log(b::Number, x::Number) in Base.Math at math.jl:186 in your example and would report an error if an exact match can’t be found.

Not sure I understand you, but expression you get with macro_form is exactly an abstract syntax tree of that function.

I understood that I can get the AST of that function but I was actually asking what would be the best way to serialize that AST. I don’t think serialize works for this case as it can’t serialize Ptr across processes.


#6

Then you can extract all methods and filter them by type with something like on Julia 0.6:

function exact_signature(m::Method, f::Function, types::Tuple{Vararg{DataType}})
    return m.sig == Tuple{typeof(f), types...}
end

ms = collect(methods(log, (Number, Number)))
exact_signature(ms[1], log, (Number, Number))  # ==> false
exact_signature(ms[2], log, (Number, Number))  # ==> false
exact_signature(ms[3], log, (Number, Number))  # ==> true

I understood that I can get the AST of that function but I was actually asking what would be the best way to serialize that AST. I don’t think serialize works for this case as it can’t serialize Ptr across processes.

As far as I know, Julia doesn’t use pointers to represent any part of its AST. Main building blocks of ASTs are Expr, Symbol, data (e.g. numbers and arrays) and a few special structures like QuoteNode, all of which are perfectly serializable. The way you deserialize it, however, is important: if your serialized function refers to anything in a current process that you don’t pass along to another process, evaluating this function will of course fail. When I was working on Spark.jl (which was much earlier than Sugar.jl appeared on the radar), extracting dependencies was exactly the most complicated part. Eventually, I ended up with an explicit macro @attach that copied passed in expression to all workers, so that users don’t get confused about what is and what isn’t available on other processes.


#7

Thanks a lot!

Eventually, I ended up with an explicit macro @attach that copied passed in expression to all workers, so that users don’t get confused about what is and what isn’t available on other processes.

Just to be sure I understand it correctly, users would apply @attach to every function that they need in the worker processes so you can avoid the complicated dependence extraction. This sounds like something I would do but I am curious how Scala Spark or PySpark deals with this issue. Does Python or Scala somehow make it easier?

After you deserialized that AST, how would you use it to define the function? I can see two approaches: 1) transform it back to a string like "function f(a::Number) a + 2 end"; or 2) create a:method Expr. What would you do? If I go for 2), I can’t yet figure out how to create the SimpleVector that’s needed for creating :method (https://docs.julialang.org/en/stable/devdocs/ast/#lowered-form)


#8

Yes and no. Both Scala and Python try to serialize everything automatically, so you may write e.g. (in Scala):

def myFunc(x: Iny): Int = { ... }
...
rdd.map(myFunc)

As you might expect, myFunc is automatically serialized and sent to other processes. What you normally don’t expect, however, is that together with myFunc Scala has to serialize its object and all dependencies. In about 3 years of experience with both - Spark and PySpark - I found that serialization errors cause about 1/3 of all errors during development. In most cases, you end up not trying to deliver to workers what needs to be there, but instead trying to stop (Py)Spark to serialize what is not intended to be serialized.

So @attach macro is just an attempt to stop this craziness, making code just a bit lower-level, but avoiding this endless fight between a programmer and a library.

Note, though, that I’ve described design decisions behind one very specific system. You may have other circumstances and take a different decision.

After you deserialized that AST, how would you use it to define the function?

Julia’s eval() is intended to evaluate expressions:

ex = deserialize(io)
eval(ex)

or, if you want to evaluate expression in the context of specific module:

ex = deserialize(io)
eval(target_module, ex)

I can’t yet figure out how to create the SimpleVector

You shouldn’t need to. AFAIK, SimpleVector is a container used on a lower level (e.g. during Julia bootstrap) and aren’t really intended for external use. If you really encounter lowered code (as opposed to code found by macro_form in files), I’d suggest replacing all occurrences of SimpleVector by normal Vector.


#9

Thank you! I made a mistake earlier and thought the AST only contains the method body not the header.

I knew that Python and Scala would create a closure that captures all dependent values but I didn’t know that they capture the function dependencies too.

I guess in order for the UDF to call functions from another package, you will need a mechanism for user to tell your Spark workers to explicit load a package – for example, apply the @attach macro to import statement etc.

Regarding trying to stop (Py)Spark to serialize what is not intended to be serialized, is this because you worried about the runtime cost of serialization or something else (like namespace collision)?


#10

We can probably work around this! But I’m currently not super motivated to, since I use Sugar still mainly for Transpiler.jl, which can only transpile functions that have only concrete types…
Feel free to improve the behaviour for abstract types :slight_smile:


#11

I worried about development time, because deep serialization may produce very unexpected results. Here’s a simplified example from this question (Python):

def vectorizer(text, vocab=vocab_dict):
	# return text vectorized using vocab

rdd.map(vectorizer)

At first glance, this code looks absolutely valid: it doesn’t refer to any global variables and doesn’t even call other functions. Surprisingly, it fails with a weird error - __init__() takes exactly 3 arguments (2 given). It turns out that vocab_dict is created only once on the driver process and is not copied to workers.

Another popular mistake in Python/Scala is to pass methods that implicitly refer to self/this, so the whole object has to be serialized bringing the whole bunch of possibly unserializable fields. To overcome it, a user has to do weird things like this from official docs:

oldfunc = self.func
batchSize = self.ctx.batchSize

def batched_func(split, iterator):
    return batched(oldfunc(split, iterator), batchSize)

func = batched_func

If we don’t copy self.func and self.ctx.batchSize to local variables, PySpark will try to serialize the whole object but fail because some fields are unserializable.

And there are many more examples. What’s even worth, often serialization errors don’t appear during development on a local machine, but only when you deploy it to cluster after several hours of work.

Julia isn’t object-oriented in a sense Python or Scala are, so many issues aren’t relevant. But Julia has its own serialization demons. Consider such an example:

inc(x) = x + 1

map(rdd, inc)

Just pick inc() from above, serialize and call it on workers, right? Wrong. inc() from above may be just one of many methods, and there’s no way to tell which methods are needed until runtime.

But if there’s only one method for inc(), we can serialize it, right? Wrong again: inc() refers to overloaded operator +, so you have to check that all possible methods of + are also available on both - driver and workers.

So in practice, you have to either provide a user with methods to restrict serialization as (Py)Spark does it, or to explicitly mark everything that the user wants to be available on other processes as Spark.jl does with @attach macro (and Julia’s ClusterManager does with @everywhere).


#12

Thanks a lot for your detailed explanation!