Anyone building a kafka streaming platform for Julia similar to Faust for Python?

schlichtanders · May 15, 2020, 9:31am

Hi all,

I am very excited about the future of Julia’s streaming usecases. With Transducers.jl and OnlineStats.jl there is already a rich foundation established.

What I am unable to find is a streaming library similar to what Faust does for Python + Kafka. Of course there are other streaming frameworks out there, but Kafka is certainly among the most widely used platforms.

Is anyone already working on implementing a Julia version of Kafa Streams / Flink / Faust?

If not I would actually be quite motivated to take Faust as an example and port the ideas to Julia
Any help, pointers, resources, contacts, etc. is highly welcome so that this is going to be a stable and widely used standard package of the julia ecosystem!

dfdx · May 15, 2020, 1:39pm

In case you start working on it, do you have any vision for managing distributed processing? Are you planning to build something atop Julia’s built-in capabilities or use cluster orchestration tools like Kubernetes?

schlichtanders · May 15, 2020, 2:41pm

Very good question, actually I haven’t thought where exactly the computation is running.
My idea so far is as simple as reimplementing Faust in Julia.

Faust can be spawn on several workers and recently there also seem to be a kubernetes example Change history for Faust 1.2 — Faust 1.9.0 documentation
I haven’t used Faust yet, but Faust seems to be the go-to implementation for Python. I just expect porting from python to be much easier than porting from Flink or Kafka Streams

If possible, of course, Julia’s built-in capabilities for cluster orchestration should be supported as well. I would expect that it is actually one of the simpler points to start with.

Ratingulate · May 15, 2020, 2:57pm

Can you build off of dagger.jl, so you don’t have to re-implement scheduler, logic etc? It’s actively developed by @jpsamaroo But not sure how much batch processing assumptions are built in.

Here’s a distributed streaming lib built off of dask, dagger’s python analogue. GitHub - python-streamz/streamz: Real-time stream processing for python

jpsamaroo · May 16, 2020, 12:11pm

Thanks for the ping! Dagger could one day be an option for implementing streaming/batch processing, however it’s currently missing the ability to dynamically extend the graph at runtime, which is important for continuous data processing (although I have a PR up that’s working on supporting this).

Dagger would also need some mechanism to ensure that streaming tasks that expect to be able to send/receive data between each other are all active at the same time (currently, the scheduler only launches as many tasks as there are worker processes), as well as a mechanism to send/receive data asynchronously between concurrently-executing tasks.

All of these are non-trivial to implement flexibly and in a performant manner, so I’ll have to give some thought to the implementation. I’ve noted these items down in my Todo list, and will try to address them in the next 2-3 months.

schlichtanders · March 14, 2022, 4:37pm

@jpsamaroo I would like to revive the dream of using Julia for streaming purposes.

What do you think is the state of Dagger.jl for streaming processing?
I see it has a distributed DataFrame as of now. Awesome!

In addition, what is the current development / roadmap plans with:

supporting unbounded datasets
supporting stateful streaming
supporting checkpointing, including checkpointing the streaming state

jar1 · March 14, 2022, 8:45pm

Does Dagger have a mechanism for back pressure? eg The History of Credit-based Flow Control (Part 1) | by OneFlow | Medium

jpsamaroo · March 25, 2022, 4:55pm

Hi @schlichtanders !

It’s definitely possible right now, as long as you manage the inter-task channels yourself (manually allocating and passing RemoteChannels to tasks that need to communicate). This isn’t guaranteed to work as-is in the long-term, since we may want to skip scheduling some tasks based on load, memory, or some other metric (which the scheduler has the right to do per Dagger’s semantics). But when that changes, I intend to add a way to force the scheduler to schedule multiple tasks all at the same time so that they can communicate.

We don’t have a built-in API for these right now. Most likely, we would provide a way to communicate to Dagger that a given task can be executed multiple times on a collection which supports iterate, and Dagger would automatically transform this into something efficient and maximally parallel.

Stateful execution and checkpointing would be harder, since the way in which states may transition are problem-dependent, and naive transformation by Dagger might produce incorrect execution ordering.

For now, I would implement these features manually (which Dagger’s current APIs should be sufficient for), and then we can see what issues we run into to inform further development.

jpsamaroo · March 25, 2022, 5:00pm

Dagger does not have a mechanism for back pressure, because we don’t have an explicit API for actors/stateful services. This would probably be best implemented in a separate package.

Topic		Replies	Views
Julia in streaming - best practises? General Usage question	3	1098	March 23, 2017
Data stream library comparable to Python streamz or tributary Data	0	790	July 2, 2019
Help find best approach to streaming data on raspberry pi General Usage	16	841	October 31, 2023
Is Julia a good choice for Data Engineering? General Usage question	11	5426	January 29, 2022
What libraries are available to work with streams? General Usage streaming	1	435	October 20, 2023

Anyone building a kafka streaming platform for Julia similar to Faust for Python?

Related topics