Design Pattern "Start from the end"?

mrufsvold · September 26, 2024, 3:54pm

I’m trying to figure out if there is a name for a design pattern I keep coming back to.

Intuitively, when designing a pipeline of tasks, I start with “Given an input, first do this, then that, and then output the result.”

This works well except that the behavior of upstream tasks can’t depend on things happening downstream (easily).

For example, I have a pipeline that reads in parquet files, flatmaps them into a stream of row groups, packs those into larger row groups up to a certain row count, partitions those into sets, and writes each set out to a new parquet file. But if I want to make sure an input file does not get split across two output files, I need the row group packing step to know if the output file is getting filled up.

In comes the design pattern:

Data pipeline orchestration tools often require you to define the requirements that must be satisfied to run a task. To kick off a pipeline, you request that the last step be completed, and then the orchestrator walks the DAG to fill in upstream requirements.

In this case, if I start with the output file, It can require a stream of packed row groups. Each of those, can require a stream of files to pack.

By making the parent task the end, upstream tasks inherit its requirements, so there are no communication issues.

Anyway, is there a name for building a DAG of requirements starting from the result that you want?

Satvik · September 26, 2024, 5:01pm

Dataflow programming / Datastream programming refers to the general concept of building a DAG first, and then using the fact that you have all the information available to execute the program efficiently, in parallel, etc. Dataflow programming - Wikipedia

It doesn’t explicitly mention starting from the end, but I think the fact that you know all the requirements at the beginning is an important part of the paradigm.

Topic		Replies	Views
Proposed overhaul of DataStreams Data	7	1387	May 25, 2017
Parallel generators and pmap generators Internals & Design	0	651	May 22, 2017
Stream processing in Julia? General Usage question	1	1014	June 28, 2019
Julia in streaming - best practises? General Usage question	3	1107	March 23, 2017
[RFC] Mr Phelps - a distributed workflow orchestrator Package Announcements	42	5712	August 4, 2021

Design Pattern "Start from the end"?

Related topics