My company often uses Luigi to manage complex pipelines. We are doing more and more programming in Julia here, so I’ve been kicking around the idea of a native Julia solution for defining and managing a pipeline.
Would that be useful to anyone? If so, any features that are particularly important to you? Any output targets that you’d like to see? Anything that bugs you about Luigi that you definitely don’t want?
Reading about acyclic graphs after learning about Dagger.jl is what actually got me thinking about this! I’d definitely appreciate any advice/thoughts on how Dagger handles task graphing.
Dagger supports declaring task graphs in pure Julia, allowing you to express complex sets of dependencies and ensure proper task ordering. Tasks may also spawn new tasks, potentially waiting on them to complete or fetching their results, making it easy to spawn entire subgraphs on-demand. Dagger also supports checkpointing task results to disk (or any arbitrary storage location), and if the process crashes or fails, Dagger can later restore those results instead of executing the task again. Additionally, Dagger has limited fault tolerance capability which will restart tasks when a worker dies to ensure that the task DAG completes even in the face of arbitrary worker failures.
Right now, Dagger’s fault tolerance and checkpointing infrastructure is still very much a WIP, and most certainly will not compare to what Luigi can recover from. This is something I’d like to improve, but I’d need help from the community to determine where our robustness falls short so that it can be fixed/improved.
Dagger also doesn’t have any native support for running jobs on Hadoop or via other programs/languages, but you can always shell out (via run) to another program, so it is possible to roll your own integration.
Sure thing! If you feel like you’re willing to give Dagger a shot at replacing Luigi for your use case, then feel free to reach out with any problems you run into.