Expected Performance of Julia within a Spark Environment?

TheCedarPrince · December 7, 2022, 5:47pm

Hey folks,

I need some help on answering a question I was posed:

Assuming an algorithm is written in R or Python with some attempt to leverage the optomizations available in a Spark environment for parallelization etc., would you expect to see more, the same, or less benefit for using Julia instead of R or Python in that Spark environment compared?

My initial instincts of an answer would be to say I would see more in that Julia would also be able to leverage Spark tooling through tools like Spark.jl or SparkSQL.jl so therefore, the queries that may need to be done leverage Spark and analysis would leverage the performance of Julia. Also, of course, it depends on how I write the algorithm in Julia + amounts of data I would be crunching I’d assume. But would there be a more nuanced answer? What am I missing? Any additional thoughts? Thanks!

~ tcp

nilshg · December 7, 2022, 8:04pm

Are you asking about the relative speed gain of Julia on spark vs serial Julia compared to the speed gain in other languages? Or about the relative speed of the final spark implementations across languages?

jling · December 7, 2022, 8:36pm

doesn’t this mean everything is executed through Spark? so Julia won’t speed things up?

it’s like using SQL.jl doesn’t make your SQL query faster, because the query is executed by database engine

TheCedarPrince · December 7, 2022, 9:12pm

It’s a bit tricky! When I was asked the question I have been trying to think through it too but I think to concretely answer your question about what question to answer, I think it would be: “I am asking about the relative speed of the final spark implementations across languages?” Honestly, I think the best answer here might be “it depends” but I was wondering if I may be missing anything else in my not-so-nuanced answer so far. Does that help Nils?

TheCedarPrince · December 7, 2022, 9:17pm

Ah alright, that’s what I was somewhat intuitively thinking – the performance of spark would moreso come from utilizing Spark and not simply the fact that Julia is embedded on a platform running Spark. But yea, in a convo with @mkitti , we concluded loosely that all things being equal (i.e. Spark is leveraged by Julia exactly the same as say R or Python), Julia has historically been faster/more performant when analyzing large datasets. So, of course with caveats, one could reasonably expect to see a bit more of a performance boost when using Julia in such an environment.

mkitti · December 7, 2022, 9:22pm

The main performance gains I have seen with Julia for distributed computing is the ability to take advantage of improved intranode communication via threads rather than interprocess communication. Julia’s multilevel parallelism offers you a chance to take advantage of the locality of the parallelization in a way that may be difficult for R or Python. The advantages also then depend on how processing is happening in native code versus in the scripting languages, especially if one employs precompilation and ahead-of-time compilation in Julia.

dfdx · December 7, 2022, 9:55pm

If you use Spark as intended, i.e. manage large distrubited datasets using SQL / DataFrame API, then the difference in performance will be negligible.

If you use Spark for something different, e.g. distributed computing in your main language, most likely you do a wrong think.

Spark was designed for processing large amounts of data, in recent versions - with heavy emphasis on SQL-like features. If you want to run distributed computations in Julia, take a look at Kubernetes.

TheCedarPrince · December 7, 2022, 10:26pm

Ah this is super helpful and helped me answer the question!

Topic		Replies	Views
When will Julia compete with Spark? Julia at Scale announcement , spark	16	8775	June 5, 2021
How to use julia for implementing machine learning algorithms in parallel General Usage parallel	9	1792	August 11, 2017
[ANN] SparkSQL.jl release 1.0.0 Package Announcements	2	706	June 19, 2021
State of distributed processing in Julia Julia at Scale	3	1655	May 14, 2019
Julia vs R vs Python Community performance	106	28505	January 13, 2019

Expected Performance of Julia within a Spark Environment?

Related topics