I think the question is not how to implement it, but who actually needs it. Some time ago I asked here what people need from a distributed computation framework and got silence in response. Spark evaluated into just a more flexible SQL database. Distributed machine learning is mostly concerned with multi-GPU training and has its own frameworks. UDFs are rare and usually it’s easier to just implement them in the native language for a framework (e.g. in Java/Scala for Spark).