I’m trying to figure out if there is a name for a design pattern I keep coming back to.
Intuitively, when designing a pipeline of tasks, I start with “Given an input, first do this, then that, and then output the result.”
This works well except that the behavior of upstream tasks can’t depend on things happening downstream (easily).
For example, I have a pipeline that reads in parquet files, flatmaps them into a stream of row groups, packs those into larger row groups up to a certain row count, partitions those into sets, and writes each set out to a new parquet file. But if I want to make sure an input file does not get split across two output files, I need the row group packing step to know if the output file is getting filled up.
In comes the design pattern:
Data pipeline orchestration tools often require you to define the requirements that must be satisfied to run a task. To kick off a pipeline, you request that the last step be completed, and then the orchestrator walks the DAG to fill in upstream requirements.
In this case, if I start with the output file, It can require a stream of packed row groups. Each of those, can require a stream of files to pack.
By making the parent task the end, upstream tasks inherit its requirements, so there are no communication issues.
Anyway, is there a name for building a DAG of requirements starting from the result that you want?