I worried about development time, because deep serialization may produce very unexpected results. Here’s a simplified example from this question (Python):
def vectorizer(text, vocab=vocab_dict):
# return text vectorized using vocab
rdd.map(vectorizer)
At first glance, this code looks absolutely valid: it doesn’t refer to any global variables and doesn’t even call other functions. Surprisingly, it fails with a weird error - __init__() takes exactly 3 arguments (2 given)
. It turns out that vocab_dict
is created only once on the driver process and is not copied to workers.
Another popular mistake in Python/Scala is to pass methods that implicitly refer to self
/this
, so the whole object has to be serialized bringing the whole bunch of possibly unserializable fields. To overcome it, a user has to do weird things like this from official docs:
oldfunc = self.func
batchSize = self.ctx.batchSize
def batched_func(split, iterator):
return batched(oldfunc(split, iterator), batchSize)
func = batched_func
If we don’t copy self.func
and self.ctx.batchSize
to local variables, PySpark will try to serialize the whole object but fail because some fields are unserializable.
And there are many more examples. What’s even worth, often serialization errors don’t appear during development on a local machine, but only when you deploy it to cluster after several hours of work.
Julia isn’t object-oriented in a sense Python or Scala are, so many issues aren’t relevant. But Julia has its own serialization demons. Consider such an example:
inc(x) = x + 1
map(rdd, inc)
Just pick inc()
from above, serialize and call it on workers, right? Wrong. inc()
from above may be just one of many methods, and there’s no way to tell which methods are needed until runtime.
But if there’s only one method for inc()
, we can serialize it, right? Wrong again: inc()
refers to overloaded operator +
, so you have to check that all possible methods of +
are also available on both - driver and workers.
So in practice, you have to either provide a user with methods to restrict serialization as (Py)Spark does it, or to explicitly mark everything that the user wants to be available on other processes as Spark.jl does with @attach
macro (and Julia’s ClusterManager
does with @everywhere
).