I need some help on answering a question I was posed:
Assuming an algorithm is written in R or Python with some attempt to leverage the optomizations available in a Spark environment for parallelization etc., would you expect to see more, the same, or less benefit for using Julia instead of R or Python in that Spark environment compared?
My initial instincts of an answer would be to say I would see more in that Julia would also be able to leverage Spark tooling through tools like Spark.jl or SparkSQL.jl so therefore, the queries that may need to be done leverage Spark and analysis would leverage the performance of Julia. Also, of course, it depends on how I write the algorithm in Julia + amounts of data I would be crunching I’d assume. But would there be a more nuanced answer? What am I missing? Any additional thoughts? Thanks!
Are you asking about the relative speed gain of Julia on spark vs serial Julia compared to the speed gain in other languages? Or about the relative speed of the final spark implementations across languages?
doesn’t this mean everything is executed through Spark? so Julia won’t speed things up?
it’s like using SQL.jl doesn’t make your SQL query faster, because the query is executed by database engine
It’s a bit tricky! When I was asked the question I have been trying to think through it too but I think to concretely answer your question about what question to answer, I think it would be: “I am asking about the relative speed of the final spark implementations across languages?” Honestly, I think the best answer here might be “it depends” but I was wondering if I may be missing anything else in my not-so-nuanced answer so far. Does that help Nils?
Ah alright, that’s what I was somewhat intuitively thinking – the performance of spark would moreso come from utilizing Spark and not simply the fact that Julia is embedded on a platform running Spark. But yea, in a convo with @mkitti , we concluded loosely that all things being equal (i.e. Spark is leveraged by Julia exactly the same as say R or Python), Julia has historically been faster/more performant when analyzing large datasets. So, of course with caveats, one could reasonably expect to see a bit more of a performance boost when using Julia in such an environment.
The main performance gains I have seen with Julia for distributed computing is the ability to take advantage of improved intranode communication via threads rather than interprocess communication. Julia’s multilevel parallelism offers you a chance to take advantage of the locality of the parallelization in a way that may be difficult for R or Python. The advantages also then depend on how processing is happening in native code versus in the scripting languages, especially if one employs precompilation and ahead-of-time compilation in Julia.
If you use Spark as intended, i.e. manage large distrubited datasets using SQL / DataFrame API, then the difference in performance will be negligible.
If you use Spark for something different, e.g. distributed computing in your main language, most likely you do a wrong think.
Spark was designed for processing large amounts of data, in recent versions - with heavy emphasis on SQL-like features. If you want to run distributed computations in Julia, take a look at Kubernetes.
Ah this is super helpful and helped me answer the question!