I’m routinely dealing with data + metadata, and I wanted to get some feedback on ideas for how to operationalize this. I’m a biologist, not a programmer, so it’s likely I’m overlooking some obvious solution. I’ll explain the issue and what I typically do, and some rough ideas for what I’d like to be able to do. Here’s a typical situation:
I have a some clinical samples that contain a bunch of quantitative features (in my case, generally microbial abundances). The most relevant metadata for each sample is the subject (person) from whom the sample came. Each sample only has one subject, but subjects may have multiple samples. I also have clinical metadata for subjects and samples. Patient-specific metadata includes things like gender and disease diagnosis, while sample-specific metadata includes things like date of collection. Some of this metadata may also be quantitative (eg blood cell counts).
In addition, there are other data products that are associated with the data tables. For example, I may do dimensionality reductions on the relative abundance table, generating things like distance matrices and MDS tables.
The way I do this currently is that I have one dataframe that contains the main quantitative data (samples are columns, features are rows). And then I have a separate dataframe that contains all metadata for each sample, where samples are rows and columns are different metadata. I often have to construct this table piecemeal, duplicating a lot of the subject-specific metadata by mapping samples → subjects and then subjects → subject metadata.
Ultimately, the goal is to be able to generate plots and other downstream analyses and easily group samples together based on various metadata. My current solution is a hacked together set of dictionaries and mapping functions that’s kind of a mess. I started think of a way to do this with types an multiple dispatch, but I’m not very creative and my first attempts were basically just embedding dictionaries inside struct
s which doesn’t seem like much of an improvement.
What I’d really like to be able to do is have something like:
struct MetaDatum
id::Symbol
kind::Symbol
value::Any
end
Where id
is a unique ID that associates with it (eg subject
or sample
) and kind
refers the type of metadata, eg :Gender
or Date
. Then, I’d like a MetaData
type that I can fill with MetaDatum
s that’s indexed by id
and kind
. I’m not quite sure how to do that indexing step, so that’s maybe the first question.
But what I’d really like to be able to do is include some linkage information, so that I can for example pull the :Gender
values for every sample
, even though there are no sample
MetaDatum
s with :Gender
info. Instead, I’d like to know that every sample
has a subject
associated, and the subject
ID should have :Gender
metadata. And potentially to go backwards as well - eg if I try to collect :Date
from subjects, I’d get back an array of :Dates
that associate with that subject
’s sample
s. The idea would be that the indexing functions would be smart enough to figure this out.
I thought about this while working on @anon94023334’s MetaGraphs
, and there might be a way to incorporate them, but this seems like a slightly different thing (and yes, I know I’ve still got an issue in that repo with my name on it! Sorry!).