What do you need from a distributed computation framework?

dfdx · February 20, 2021, 2:16pm

Recently I was thinking about adding UDF support to Spark.jl - a Julia bindings for Apache Spark. UDFs is how you run custom functions on distributed datasets in Python and Scala nowadays. However, adding them to Julia turned to be quite a huge task, requiring not just one-time investment, but continuous support in all future releases.

This made me reconsider the task we are trying to solve. I personally use distributed computing mostly to prepare datasets from files on AWS S3 + (rarely) to run distributed streaming applications. It also would be great to be able to implement some of the distributed ML algorithms. But these things can be done using many tools, including Spark, Flink, Julia’s ClusterManagers, custom Kubernetes application, etc.

To better understand the need from the community, I wonder what do you need from a distributed computation framework?

xiaodai · July 20, 2021, 11:30pm

Sounds like too hard. I think it’s a numbers’ game. If Julia community got big enough then Spark will make the UDF function available. It doesn’t feel like something that is sustainable to maintain by one or two-person in the Julia community unless they really want it!

Perhaps the right way if for the motivated individual to revive JuliaDB

Topic		Replies	Views
If tuplex can do it. So can Julia! Data	4	806	July 21, 2021
When will Julia compete with Spark? Julia at Scale announcement , spark	16	8678	June 5, 2021
State of distributed processing in Julia Julia at Scale	3	1637	May 14, 2019
Expected Performance of Julia within a Spark Environment? General Usage performance , spark	7	551	December 7, 2022
Distributed Julia in the cloud Julia at Scale	18	2254	March 10, 2022

What do you need from a distributed computation framework?

Related topics