Best Practices for Harmonizing Datasets

TheCedarPrince · April 1, 2021, 2:52pm

Hi everyone!

I have a question about data manipulation that is more on the conceptual side:

Say you have multiple datasets that you are trying to harmonize. Each dataset requires their own amount of processing to get it into a form for merging together. How should I best structure my project for harmonizing?

Right now I have been using DrWatson.jl to organize everything (i.e. data, scripts, packages, constants) which works really well at a high level. However, right now my processing pipeline has been to create separate scripts for each different dataset that details how to parse the dataset to a harmonized template. It is a bit clunky to have to call individual scripts so I am wondering if the best way to structure my pipeline is to modularize these scripts in such a way as to make them more like function-based units.

Does anyone have any thoughts on this? Thank you!

~ tcp

rikh · April 1, 2021, 3:26pm

I think that working in packages is a better idea than working in scripts. The problem with scripts is, as you notice, that they are hard to combine. This is what packages are meant to solve. Also, you get unit testing, documentation and many other package related things for free.

For example, I personally have these kinds of data transformations in my Codex package (GitHub - rikhuijzer/Codex.jl: Helper functions). (Please don’t look at the code, it’s pretty bad, but it works.) Currently, I have about 6 other packages depending on that one and this setup works reliably.

TheCedarPrince · April 1, 2021, 3:56pm

Hey @rikh - I really like your idea of Codex (haha, don’t worry - I didn’t look at your code! ). I figured the best solution was to more functionalize the code but wasn’t sure if there were other approaches I should look at. Thanks for the thoughts Rik! (sidebar: hope your work is going well! I was looking at Books.jl again the other day. Great stuff!)

rikh · April 1, 2021, 4:04pm

True, good question. There might be. Let’s hope that others still dare to reply to this topic after my strong stance in my comment. I’ll try to nuace it a bit more.

Thanks, Jacob!

TheCedarPrince · April 1, 2021, 4:10pm

Oh I completely agree with your stance. I was thinking if there was something else or a paradigm that fits within the notion of putting things into packages. Similar to how there is DrWatson.jl for organizing components of a project, I was wondering if there was a paradigm or meta-package, if you will, that I was unaware of for helping with the actual organization of the code.

Hunh. Actually, that might be a good idea for a package.

rikh · April 1, 2021, 4:26pm

I don’t know whether I understand correctly, but maybe you mean something like GitHub - JuliaCI/PkgTemplates.jl: Create new Julia packages, the easy way?

Topic		Replies	Views
Need help understanding best practices to organise projects with many scripts General Usage question , modules , code-organization	1	630	January 11, 2022
State of the Art for Data Version Control? General Usage question , data	5	754	November 30, 2022
JuliaML organization and MLJ.jl Machine Learning	5	1468	August 19, 2019
Request for feedback on potential CSV.jl feature Data csv , tables	11	1162	June 7, 2021
Good workflow for data import/cleanup Data import-data	6	338	June 23, 2023

Best Practices for Harmonizing Datasets

Related topics