I’m routinely dealing with data + metadata, and I wanted to get some feedback on ideas for how to operationalize this. I’m a biologist, not a programmer, so it’s likely I’m overlooking some obvious solution. I’ll explain the issue and what I typically do, and some rough ideas for what I’d like to be able to do. Here’s a typical situation:
I have a some clinical samples that contain a bunch of quantitative features (in my case, generally microbial abundances). The most relevant metadata for each sample is the subject (person) from whom the sample came. Each sample only has one subject, but subjects may have multiple samples. I also have clinical metadata for subjects and samples. Patient-specific metadata includes things like gender and disease diagnosis, while sample-specific metadata includes things like date of collection. Some of this metadata may also be quantitative (eg blood cell counts).
In addition, there are other data products that are associated with the data tables. For example, I may do dimensionality reductions on the relative abundance table, generating things like distance matrices and MDS tables.
The way I do this currently is that I have one dataframe that contains the main quantitative data (samples are columns, features are rows). And then I have a separate dataframe that contains all metadata for each sample, where samples are rows and columns are different metadata. I often have to construct this table piecemeal, duplicating a lot of the subject-specific metadata by mapping samples -> subjects and then subjects -> subject metadata.
Ultimately, the goal is to be able to generate plots and other downstream analyses and easily group samples together based on various metadata. My current solution is a hacked together set of dictionaries and mapping functions that’s kind of a mess. I started think of a way to do this with types an multiple dispatch, but I’m not very creative and my first attempts were basically just embedding dictionaries inside
structs which doesn’t seem like much of an improvement.
What I’d really like to be able to do is have something like:
struct MetaDatum id::Symbol kind::Symbol value::Any end
id is a unique ID that associates with it (eg
kind refers the type of metadata, eg
Date. Then, I’d like a
MetaData type that I can fill with
MetaDatums that’s indexed by
kind. I’m not quite sure how to do that indexing step, so that’s maybe the first question.
But what I’d really like to be able to do is include some linkage information, so that I can for example pull the
:Gender values for every
sample, even though there are no
:Gender info. Instead, I’d like to know that every
sample has a
subject associated, and the
subject ID should have
:Gender metadata. And potentially to go backwards as well - eg if I try to collect
:Date from subjects, I’d get back an array of
:Dates that associate with that
samples. The idea would be that the indexing functions would be smart enough to figure this out.
I thought about this while working on @sbromberger’s
MetaGraphs, and there might be a way to incorporate them, but this seems like a slightly different thing (and yes, I know I’ve still got an issue in that repo with my name on it! Sorry!).