What do you need from a distributed computation framework?

Recently I was thinking about adding UDF support to Spark.jl - a Julia bindings for Apache Spark. UDFs is how you run custom functions on distributed datasets in Python and Scala nowadays. However, adding them to Julia turned to be quite a huge task, requiring not just one-time investment, but continuous support in all future releases.

This made me reconsider the task we are trying to solve. I personally use distributed computing mostly to prepare datasets from files on AWS S3 + (rarely) to run distributed streaming applications. It also would be great to be able to implement some of the distributed ML algorithms. But these things can be done using many tools, including Spark, Flink, Julia’s ClusterManagers, custom Kubernetes application, etc.

To better understand the need from the community, I wonder what do you need from a distributed computation framework?


Sounds like too hard. I think it’s a numbers’ game. If Julia community got big enough then Spark will make the UDF function available. It doesn’t feel like something that is sustainable to maintain by one or two-person in the Julia community unless they really want it!

Perhaps the right way if for the motivated individual to revive JuliaDB