Anyone building a kafka streaming platform for Julia similar to Faust for Python?

Hi all,

I am very excited about the future of Julia’s streaming usecases. With Transducers.jl and OnlineStats.jl there is already a rich foundation established.

What I am unable to find is a streaming library similar to what Faust does for Python + Kafka. Of course there are other streaming frameworks out there, but Kafka is certainly among the most widely used platforms.

Is anyone already working on implementing a Julia version of Kafa Streams / Flink / Faust?

If not I would actually be quite motivated to take Faust as an example and port the ideas to Julia :wink:
Any help, pointers, resources, contacts, etc. is highly welcome so that this is going to be a stable and widely used standard package of the julia ecosystem!

3 Likes

In case you start working on it, do you have any vision for managing distributed processing? Are you planning to build something atop Julia’s built-in capabilities or use cluster orchestration tools like Kubernetes?

Very good question, actually I haven’t thought where exactly the computation is running.
My idea so far is as simple as reimplementing Faust in Julia.

Faust can be spawn on several workers and recently there also seem to be a kubernetes example https://faust.readthedocs.io/en/latest/history/changelog-1.2.html?highlight=kubernetes
I haven’t used Faust yet, but Faust seems to be the go-to implementation for Python. I just expect porting from python to be much easier than porting from Flink or Kafka Streams :wink:

If possible, of course, Julia’s built-in capabilities for cluster orchestration should be supported as well. I would expect that it is actually one of the simpler points to start with.

Can you build off of dagger.jl, so you don’t have to re-implement scheduler, logic etc? It’s actively developed by @jpsamaroo But not sure how much batch processing assumptions are built in.

Here’s a distributed streaming lib built off of dask, dagger’s python analogue. https://github.com/python-streamz/streamz

3 Likes

Thanks for the ping! Dagger could one day be an option for implementing streaming/batch processing, however it’s currently missing the ability to dynamically extend the graph at runtime, which is important for continuous data processing (although I have a PR up that’s working on supporting this).

Dagger would also need some mechanism to ensure that streaming tasks that expect to be able to send/receive data between each other are all active at the same time (currently, the scheduler only launches as many tasks as there are worker processes), as well as a mechanism to send/receive data asynchronously between concurrently-executing tasks.

All of these are non-trivial to implement flexibly and in a performant manner, so I’ll have to give some thought to the implementation. I’ve noted these items down in my Todo list, and will try to address them in the next 2-3 months.

2 Likes

@jpsamaroo I would like to revive the dream of using Julia for streaming purposes.

What do you think is the state of Dagger.jl for streaming processing?
I see it has a distributed DataFrame as of now. Awesome!

In addition, what is the current development / roadmap plans with:

  • supporting unbounded datasets
  • supporting stateful streaming
  • supporting checkpointing, including checkpointing the streaming state
1 Like

Does Dagger have a mechanism for back pressure? eg The History of Credit-based Flow Control (Part 1) | by OneFlow | Jan, 2022 | Medium

Hi @schlichtanders !

It’s definitely possible right now, as long as you manage the inter-task channels yourself (manually allocating and passing RemoteChannels to tasks that need to communicate). This isn’t guaranteed to work as-is in the long-term, since we may want to skip scheduling some tasks based on load, memory, or some other metric (which the scheduler has the right to do per Dagger’s semantics). But when that changes, I intend to add a way to force the scheduler to schedule multiple tasks all at the same time so that they can communicate.

We don’t have a built-in API for these right now. Most likely, we would provide a way to communicate to Dagger that a given task can be executed multiple times on a collection which supports iterate, and Dagger would automatically transform this into something efficient and maximally parallel.

Stateful execution and checkpointing would be harder, since the way in which states may transition are problem-dependent, and naive transformation by Dagger might produce incorrect execution ordering.

For now, I would implement these features manually (which Dagger’s current APIs should be sufficient for), and then we can see what issues we run into to inform further development.

1 Like

Dagger does not have a mechanism for back pressure, because we don’t have an explicit API for actors/stateful services. This would probably be best implemented in a separate package.

1 Like