What do you need from a distributed computation framework?

Recently I was thinking about adding UDF support to Spark.jl - a Julia bindings for Apache Spark. UDFs is how you run custom functions on distributed datasets in Python and Scala nowadays. However, adding them to Julia turned to be quite a huge task, requiring not just one-time investment, but continuous support in all future releases.

This made me reconsider the task we are trying to solve. I personally use distributed computing mostly to prepare datasets from files on AWS S3 + (rarely) to run distributed streaming applications. It also would be great to be able to implement some of the distributed ML algorithms. But these things can be done using many tools, including Spark, Flink, Julia’s ClusterManagers, custom Kubernetes application, etc.

To better understand the need from the community, I wonder what do you need from a distributed computation framework?

2 Likes