Cross-posting from GitHub issue.
Originally, Spark.jl was developed for Spark 1.x with its RDD interface, e.g. most useful functions were map
, filter
, etc. In Spark 2.0, the default interface changed to Dataset
( DataFrame
in PySpark). And now, with Spark 3.0, I meet more and more people who don’t even know about RDD interface.
Another part of this equation is that our implementation of RDD API is quite unstable due to complicated interprocess communication. Compare steps for running a map
function over RDD:
- Start Julia program and instantiate corresponding driver in JVM.
- Serialize Julia function and data, send them to Spark workers (JVM).
- In each worker, start a new Julia process, connect to it and pass serialized function and data.
- Serialize the result and send it back to JVM.
We have to implement and support each of these communication processes, as well as handle all possible errors and use cases. Function serialization is a disaster. Different versions of Julia on driver and workers are not covered at all. In comparison, DataFrame API consists only from these steps:
- Start Julia program and instantiate corresponding driver in JVM.
- Call JVM functions via JavaCall.
The only moment when we (may) need to serialize Julia functions is in UDFs, which are much more controllable than arbitrary functions.
So my question to Spark / Spark.jl users is: do you actually need RDD API?