I am planing to build a package for ETL processes as my Bachelor thesis.
My plan is to to have the package be a simple way to build data pipelines following ETL. Having a simple way to load the data from different sources, transforming the data and then loading it back into storage or use it elsewhere.
I am still trying to figure out what i should include into the package. So far i am thinking about the following:
connectivity to multiple data-sources
data transformation with custom pipelines
might add support for data cleaning
might add aggregations, filters, joins, splits
multi threading/parallel computing
logging and monitoring
multi source and target
It would be helpfull to get some insights from you guys.
Please tell me if i am missing anything important.
I tried something similar a while ago when I needed an automation solution for work. I started to build very basic custom pipelines reading data from databases and parquet files (DuckDB.jl), cleaning and aggregating data (DataFrames.jl) and saving the results to parquet files.
I can’t find the threads now, but I remember there were some initial tries to program a framework like this. However, I think they ended up abandoned.
While I found julia quite suitable for the task, in my project I discovered myself basically reinventing dlt and dbt, so I just ended up using these existing solutions instead.
The advantage of Julia would be to use the repl to write custom code in Julia and then register it as a UDF to run it inside of Duckdb. This would leverage the c function performance to run models or optimizations within the stream processing.