Planing on building an ETL Framework

I am planing to build a package for ETL processes as my Bachelor thesis.
My plan is to to have the package be a simple way to build data pipelines following ETL. Having a simple way to load the data from different sources, transforming the data and then loading it back into storage or use it elsewhere.

I am still trying to figure out what i should include into the package. So far i am thinking about the following:

  • connectivity to multiple data-sources
  • data transformation with custom pipelines
    • might add support for data cleaning
    • might add aggregations, filters, joins, splits
  • multi threading/parallel computing
  • logging and monitoring
  • multi source and target

It would be helpfull to get some insights from you guys.
Please tell me if i am missing anything important.

4 Likes

I tried something similar a while ago when I needed an automation solution for work. I started to build very basic custom pipelines reading data from databases and parquet files (DuckDB.jl), cleaning and aggregating data (DataFrames.jl) and saving the results to parquet files.

I can’t find the threads now, but I remember there were some initial tries to program a framework like this. However, I think they ended up abandoned.

While I found julia quite suitable for the task, in my project I discovered myself basically reinventing dlt and dbt, so I just ended up using these existing solutions instead.

1 Like

The advantage of Julia would be to use the repl to write custom code in Julia and then register it as a UDF to run it inside of Duckdb. This would leverage the c function performance to run models or optimizations within the stream processing.