Best Practices for Harmonizing Datasets

Hi everyone!

I have a question about data manipulation that is more on the conceptual side:

Say you have multiple datasets that you are trying to harmonize. Each dataset requires their own amount of processing to get it into a form for merging together. How should I best structure my project for harmonizing?

Right now I have been using DrWatson.jl to organize everything (i.e. data, scripts, packages, constants) which works really well at a high level. However, right now my processing pipeline has been to create separate scripts for each different dataset that details how to parse the dataset to a harmonized template. It is a bit clunky to have to call individual scripts so I am wondering if the best way to structure my pipeline is to modularize these scripts in such a way as to make them more like function-based units.

Does anyone have any thoughts on this? Thank you!

~ tcp :deciduous_tree:

I think that working in packages is a better idea than working in scripts. The problem with scripts is, as you notice, that they are hard to combine. This is what packages are meant to solve. Also, you get unit testing, documentation and many other package related things for free.

For example, I personally have these kinds of data transformations in my Codex package (GitHub - rikhuijzer/Codex.jl: Helper functions). (Please don’t look at the code, it’s pretty bad, but it works.) Currently, I have about 6 other packages depending on that one and this setup works reliably.

2 Likes

Hey @rikh - I really like your idea of Codex (haha, don’t worry - I didn’t look at your code! :wink:). I figured the best solution was to more functionalize the code but wasn’t sure if there were other approaches I should look at. Thanks for the thoughts Rik! (sidebar: hope your work is going well! I was looking at Books.jl again the other day. Great stuff!)

1 Like

True, good question. There might be. Let’s hope that others still dare to reply to this topic after my strong stance in my comment. I’ll try to nuace it a bit more.

Thanks, Jacob! :slightly_smiling_face:

1 Like

Oh I completely agree with your stance. I was thinking if there was something else or a paradigm that fits within the notion of putting things into packages. Similar to how there is DrWatson.jl for organizing components of a project, I was wondering if there was a paradigm or meta-package, if you will, that I was unaware of for helping with the actual organization of the code.

Hunh. Actually, that might be a good idea for a package. :thinking:

I don’t know whether I understand correctly, but maybe you mean something like GitHub - JuliaCI/PkgTemplates.jl: Create new Julia packages, the easy way?

1 Like