A type for metadata?

I’m routinely dealing with data + metadata, and I wanted to get some feedback on ideas for how to operationalize this. I’m a biologist, not a programmer, so it’s likely I’m overlooking some obvious solution. I’ll explain the issue and what I typically do, and some rough ideas for what I’d like to be able to do. Here’s a typical situation:

I have a some clinical samples that contain a bunch of quantitative features (in my case, generally microbial abundances). The most relevant metadata for each sample is the subject (person) from whom the sample came. Each sample only has one subject, but subjects may have multiple samples. I also have clinical metadata for subjects and samples. Patient-specific metadata includes things like gender and disease diagnosis, while sample-specific metadata includes things like date of collection. Some of this metadata may also be quantitative (eg blood cell counts).

In addition, there are other data products that are associated with the data tables. For example, I may do dimensionality reductions on the relative abundance table, generating things like distance matrices and MDS tables.

The way I do this currently is that I have one dataframe that contains the main quantitative data (samples are columns, features are rows). And then I have a separate dataframe that contains all metadata for each sample, where samples are rows and columns are different metadata. I often have to construct this table piecemeal, duplicating a lot of the subject-specific metadata by mapping samples → subjects and then subjects → subject metadata.

Ultimately, the goal is to be able to generate plots and other downstream analyses and easily group samples together based on various metadata. My current solution is a hacked together set of dictionaries and mapping functions that’s kind of a mess. I started think of a way to do this with types an multiple dispatch, but I’m not very creative and my first attempts were basically just embedding dictionaries inside structs which doesn’t seem like much of an improvement.

What I’d really like to be able to do is have something like:

struct MetaDatum

Where id is a unique ID that associates with it (eg subject or sample) and kind refers the type of metadata, eg :Gender or Date. Then, I’d like a MetaData type that I can fill with MetaDatums that’s indexed by id and kind. I’m not quite sure how to do that indexing step, so that’s maybe the first question.

But what I’d really like to be able to do is include some linkage information, so that I can for example pull the :Gender values for every sample, even though there are no sample MetaDatums with :Gender info. Instead, I’d like to know that every sample has a subject associated, and the subject ID should have :Gender metadata. And potentially to go backwards as well - eg if I try to collect :Date from subjects, I’d get back an array of :Dates that associate with that subject’s samples. The idea would be that the indexing functions would be smart enough to figure this out.

I thought about this while working on @anon94023334’s MetaGraphs, and there might be a way to incorporate them, but this seems like a slightly different thing (and yes, I know I’ve still got an issue in that repo with my name on it! Sorry!).

If it exists for each sample, I would hazard that it is data, not metadata.

Such redundancy is part of the normal workflow for tidy data. Is your workflow similar to split-apply-combine?

1 Like

This makes some sense, though I’m referring to data about the sample, rather than data measured in the sample. Eg. date of collection. I think of this as metadata.

Not really… At least, I don’t think so. Generally speaking I’m doing most data operations with the sample x feature dataframe. Then, when plotting, I’m pulling stuff from the metadata to generate colors and groupings etc. In principle, I could construct yet another table that has various coordinates and metadata, and then use split-apply-combine there, but that’s sort of what I’m trying to avoid.

I forgot to mention also that I’d like to handle meta-metadata (eg color mappings to particular diagnoses).

I know there are ways to do all of this, and it’s something that a lot of data scientists deal with, but it seems like there should be a better interface…

One general comment is that you shouldn’t necessarily assume that you must mangle your data into a tabular format. Julia makes it very easy to just build a bunch of structs and then arrange them hierarchically in whatever you deem to be the appropriate schema. This is how I used to do things in C++ when I was doing high energy physics. At least in the sector of private industry in which I now work as a data scientist, the attitude seems to be that non-tabular formats are some kind of unholy abomination, and I really don’t understand why that is. Probably because of the presence of lots of business people who can write SQL queries, but won’t write any other type of code (which always seemed really strange to me since their SQL queries tend to be waaay more complicated than any Python code they might write in their place).

I’m currently working on a project that involves the use of JuMP for some “large scale” (i.e. lots of variables) quadratic optimization, and I had been trying hard to find ways of ripping data directly out of dataframes and using them to create arrays which appear as constants in my optimization problem. This had a tendency to get really ugly because of all the relational database operations involved. Now I have a Julia code abstraction of the type of problem I’m working on, and I’m much happier for it. It’s only a few hundred lines of code, and for the most part the functions that generate the structures I need make much more sense than they ever did when I was using only dataframes. I still have to go through the agony of mangling the tabular formats that I’m provided with into the appropriate form to put into my structs, but I had to do the data manipulation regardless. Once it gets into my structs everything is elegant and easy to follow.

So, you might consider looking at your tabular format as more of an IO device and write a little bit of code for your actual problem, especially if it can be generalized.

1 Like

:100: This is what I’m trying to get away from. Everything is read in as tables (all of the software I use upstream generates tsv/csv outputs, and the clinicians only use excel (I sent one a csv and he didn’t know how to open it), but I’m hoping I can have some sort of structure where I can just do retrieve(my_metadata, type; ids=[s1, s2, s3]) and have it find all of the associated values intelligently.

Definitely - this is what I’m trying to reason about. As I mentioned, my first-pass solution is a bunch of nested dictionaries, but this gets out of hand fast. I suppose this question is about trying to reason about my problem in a way that can be translated to code - I’m still not quite sure how to do that. I definitely think it’s likely to be generalizable though.

In that case my advice at this point would basically just be “do what you think is right” and go from there. You might just have to live with the fact that you will have to write some code to get in and out of tabular formats and that that may be the most painful part of your code. There is no completely general way to structure data as a programming abstraction, you will have to decide one what code makes sense for your particular problem, but Julia is a highly “productive” language so just going ahead and writing the code using your own structs and functions using your own good judgment is likely to be a worthwhile endeavor. There’s also nothing stopping you from using all the tools at your disposal: DataFrames are wonderfully light-weight if you need to keep them around, you of course have AbstractDict and AbstractArray, SparseArrays StaticArrays, and it sounds like LightGraphs might be relevant for your case. You might have a struct containing a graph node and a dataframe or anything else, go nuts!

1 Like

I’ve just come across another possible usecase for something like this. I have a df with a lot of columns. I’d like to apply some transformation to each column, but which particular transformation to use depends on the “type” of the column (not in the Julia type system sense). If there was some metadata available along with the DataFrame, I could use logic on that metadata to decide what to do with each column.

Specifically, I have a dataset with many series. Some of the series are in levels, some are already in growth rates. I’d like to get the entire thing to growth rates, but how to compute the growth rates also may differ by column. Besides building a second DataFrame or Dict with the proper flag, I’m not sure that there is a nice way to do this. But a metadata type that was linked to the underlying data would take care of it.

1 Like

I think I would have tried to wrap the values in a Rates type or a Levels type, and write methods for both. You can have identity for the method for Rates since it’s already a rate, and something real for the Levels one.

1 Like

This is the approach I would use, but simply map each column name to a closure instead of a flag.

I find this much more transparent than a metadata framework. IMO the problem with the latter is that it becomes a kitchen sink for unstructured data people did not want to think too much about. This happened to R to a certain extent.

It seems reasonable to associate some meta data to columns in a DataFrame. Having to track them in a separate Dict would possibly backfire because you would have to maintain the consistency between two separate data structures.

Consider the case of reading data from a relational database. Each column has metadata like data type, length, precision, allow nulls, etc. Often these are good information that is very handy when processing query results.

Something like this would be fairly easy to implement in DataFrames?

putmeta!(df, :column1, @NT(kind = :level))
putmeta!(df, :column2, @NT(kind = :rate))
getmeta(df, :column1)  # returns @NT(kind = :level)
colswith(df, t => t.kind == :level)  # returns an iterator over :level columns

Thinking more about it (especially as I have a similar use case: having many data points per subject and some survey questions about each subject) it seems to me that custom AbstractArray and getproperty overload get you a long way. For example, with something like MetadataArrays you could create a MetadataVector of the length of your DataFrame with all your subjects, repeated, and a smaller DataFrame with their info:

v = MetadataArray(["Luke", "Luke", "Jane", "Jane"], Dict("Luke" => (age = 22, gender = :male),  "Jane" => (age = 74, gender = :female)))

and define getproperty on this class of MetadataArrays such that v.age[i] would give you the age of the person at row i.

I have a similar use case with recordings (of, say, heart rate) during a session where a subject is doing many trials of some task. The Array of recordings is session specific but quite big and I shouldn’t insert it at every row, instead I think I should keep in the metadata a “session to recordings array” dictionary.

I’ll play a bit more with MetadataArrays and see if I can get these two usecases covered and will update here when I have more details on the design.

1 Like

The way I do this currently is that I have one dataframe that contains the main quantitative data (samples are columns, features are rows). And then I have a separate dataframe that contains all metadata for each sample, where samples are rows and columns are different metadata. I often have to construct this table piecemeal, duplicating a lot of the subject-specific metadata by mapping samples -> subjects and then subjects -> subject metadata.

This sounds a lot like what databases are designed for. You have multiple tables with foreign keys.

  1. Subjects table: subject_id, subject_name, gender, diagnosis
  2. Samples table: sample_id, subject_id, collection_date, collection_location, sample_data

Then to calculate things you need to join these together

select subject_id, subject_name, gender, sample_data from Subjects 
LEFT JOIN Samples on Subject.subject_id = Samples.subject_id
where Samples.collection_date = 2017-12-13 and Subjects.diagnosis = 'hypertension'

My approach would be to normalize everything into relational tables and use a “real” database, or make custom structs for my problem and just write Julia code. The more I use RDBMSes the more I feel like whenever I have more than two dataframes in my code, I should switch to Postgres.