The way I do this currently is that I have one dataframe that contains the main quantitative data (samples are columns, features are rows). And then I have a separate dataframe that contains all metadata for each sample, where samples are rows and columns are different metadata. I often have to construct this table piecemeal, duplicating a lot of the subject-specific metadata by mapping samples -> subjects and then subjects -> subject metadata.
This sounds a lot like what databases are designed for. You have multiple tables with foreign keys.
- Subjects table:
subject_id, subject_name, gender, diagnosis - Samples table:
sample_id, subject_id, collection_date, collection_location, sample_data
Then to calculate things you need to join these together
select subject_id, subject_name, gender, sample_data from Subjects
LEFT JOIN Samples on Subject.subject_id = Samples.subject_id
where Samples.collection_date = 2017-12-13 and Subjects.diagnosis = 'hypertension'
My approach would be to normalize everything into relational tables and use a “real” database, or make custom structs for my problem and just write Julia code. The more I use RDBMSes the more I feel like whenever I have more than two dataframes in my code, I should switch to Postgres.