I am very excited about the future of Julia’s streaming usecases. With Transducers.jl and OnlineStats.jl there is already a rich foundation established.
What I am unable to find is a streaming library similar to what Faust does for Python + Kafka. Of course there are other streaming frameworks out there, but Kafka is certainly among the most widely used platforms.
Is anyone already working on implementing a Julia version of Kafa Streams / Flink / Faust?
If not I would actually be quite motivated to take Faust as an example and port the ideas to Julia
Any help, pointers, resources, contacts, etc. is highly welcome so that this is going to be a stable and widely used standard package of the julia ecosystem!
In case you start working on it, do you have any vision for managing distributed processing? Are you planning to build something atop Julia’s built-in capabilities or use cluster orchestration tools like Kubernetes?
Very good question, actually I haven’t thought where exactly the computation is running.
My idea so far is as simple as reimplementing Faust in Julia.
Faust can be spawn on several workers and recently there also seem to be a kubernetes example Change history for Faust 1.2 — Faust 1.9.0 documentation
I haven’t used Faust yet, but Faust seems to be the go-to implementation for Python. I just expect porting from python to be much easier than porting from Flink or Kafka Streams
If possible, of course, Julia’s built-in capabilities for cluster orchestration should be supported as well. I would expect that it is actually one of the simpler points to start with.
Can you build off of dagger.jl, so you don’t have to re-implement scheduler, logic etc? It’s actively developed by @jpsamaroo But not sure how much batch processing assumptions are built in.
Thanks for the ping! Dagger could one day be an option for implementing streaming/batch processing, however it’s currently missing the ability to dynamically extend the graph at runtime, which is important for continuous data processing (although I have a PR up that’s working on supporting this).
Dagger would also need some mechanism to ensure that streaming tasks that expect to be able to send/receive data between each other are all active at the same time (currently, the scheduler only launches as many tasks as there are worker processes), as well as a mechanism to send/receive data asynchronously between concurrently-executing tasks.
All of these are non-trivial to implement flexibly and in a performant manner, so I’ll have to give some thought to the implementation. I’ve noted these items down in my Todo list, and will try to address them in the next 2-3 months.
It’s definitely possible right now, as long as you manage the inter-task channels yourself (manually allocating and passing RemoteChannels to tasks that need to communicate). This isn’t guaranteed to work as-is in the long-term, since we may want to skip scheduling some tasks based on load, memory, or some other metric (which the scheduler has the right to do per Dagger’s semantics). But when that changes, I intend to add a way to force the scheduler to schedule multiple tasks all at the same time so that they can communicate.
We don’t have a built-in API for these right now. Most likely, we would provide a way to communicate to Dagger that a given task can be executed multiple times on a collection which supports iterate, and Dagger would automatically transform this into something efficient and maximally parallel.
Stateful execution and checkpointing would be harder, since the way in which states may transition are problem-dependent, and naive transformation by Dagger might produce incorrect execution ordering.
For now, I would implement these features manually (which Dagger’s current APIs should be sufficient for), and then we can see what issues we run into to inform further development.
Dagger does not have a mechanism for back pressure, because we don’t have an explicit API for actors/stateful services. This would probably be best implemented in a separate package.