How to add metadata info to a DataFrame?

Fair enough! :wink:

Moreover, in both cases we deal with metadata, but the implementation details can be quite different.
Hence I propose to clearly state whether a given comment/suggestion/proposal aims to deal with column or global metadata or both.

Mine deals with both, but is admittently limited to DataFrames object.

I’m concerned that having the metadata attached to the columns themselves is not generic, it only makes sense for a column major store (and most all databases are store rows, except for some of the ones like Vertica).

For data coming from a database, you’d have metadata about the database, the tables in the database, and the columns in each table (which you’d get back as a vector, with the type information, labels etc. about each column in the rows).

If you do want to select some subset of the columns, it’s very easy in Julia to index that vector of metadata with the range or a vector of indexes to get that subset, just as you’d do with the columns themselves.

Maybe what’s needed is a “AbstractTable” (or is there already one?), which has an API for metadata,
not limited to DataFrames, handling both the table- and column- metadata.

Having a delegation mechanism like:

mutable struct MetadataContainer{T}
  meta::Dict
  data::T
end

# add metadata to DataFrame
@delegate DataFrame MetadataContainer{DataFrame} ".data"

# add metadata to arrays
@delegate AbstractArray MetadataContainer{AbstractArray} ".data"

# add metadata to Int numbers
@delegate AbstractArray MetadataContainer{Int} ".data"

would be the best choice. After all, it’s just a matter of adding appropriate entries in the dispatch table, i.e.:

showcols(p1::MetadataContainer{DataFrame}) = showcols(p1.data)

This can be done through TypedDelegation, although all methods must be listed one by one.

A macro which automatically does this job, based on the methods returned by methodswith would be really useful. Do something similar already exists? Or will the core Julia ever support delegation?

Sorry but I don’t follow. Storage is orthogonal to the concept of “columns” (values with homogeneous types along on coordinate). Column-specific metadata would make sense even if one stored rows as tuples.

Yes, that would be nice, but currently no such thing exists AFAIK.

Note, however, that it wouldn’t magically solve all problems: if you want getindex to preserve meta-data, you need to implement custom methods which delegate to the wrapped vector/data frame, and rewrap the result in a MetadataContainer. But at least by default all methods would work (and discard meta-data).

Let me be concrete here. I work with human biological data, specifically microbiomes. A typical dataset is a series (on the order of hundreds) of samples, each of which has thousands to tens of thousands of measurements (the relative abundances of various microbial species). The way the data is currently structured is as a sparse matrix with samples as columns and rows as microbial features. Eg:

using DataFrames
taxa = DataFrame(species=["species_$x" for x in 1:10],
                    sample1=rand(10),
                    sample2=rand(10),
                    sample3=rand(10))

Many of these samples also have a ton of relevant clinical metadata associated with them, such as patient age, disease diagnosis, medication history etc. Usually, this comes to me in a separate table where rows are samples and columns are the type of information

metadata = DataFrame(sample=["sample1", "sample2", "sample3"],
                    age_in_years=[30, 25, 57],
                    diagnosis=["Healthy Control", "RA", missing],
                    antibiotics=[true, false, missing])

The metadata is often incomplete, and I only have certain information for certain samples, and some samples that don’t have any metadata. Some of my analyses don’t require any metadata, or only require some of it, so I often have a complex series of dicts and comprehentions that I have to do in order to subset data in various ways. This throws all of the orders of things out of whack (I’ve been bitten by not keeping track of the order of arrays), so I often use dicts of dicts etc, but have to deal with missings all the time and it’s an utter mess. I’d say trying to cope with this stuff is between 70 and 85% of my analysis time.

So to be clear and restate things - I need column-level metadata. The implementation details (whether it’s stored with the DataFrame or attached to the vectors themselves) matter less to me, but I would need the metadata to follow columns in subseting and views (I’ve recently started mixing in Query, but I mostly do it with indexing and view().

Another wrinkle to consider - I rather like the idea of tying this stuff to a generic table implementation, since SpatialEcology (which my package is now based off of) we use CommMatrixes that are wrappers around sparse matrices that have special functions associated with them. Having a way to use Metadata generically across many types of data representations (including DataTables, SparseMatrices… basically everything supported by IterableTables) would be really lovely (though also I’m sure a lot of work that would save me a heap of time, but that I don’t have the time to implement well).

Yeah, this would be amazing. @mkborregaard wrote a macro in SpatialEcology that seems similar to the type delegation you linked to, but also requires explicitly pointing out the methods you want forwarded.

I suspect it’s slightly (or much) more complicated because methods that operate on multiple arguments also need to be included.

I am not sure this is metadata; this should fit nicely in dataframes that describe various levels of the experiment, and should be amenable to formatting to “tidy data” using join. Am I missing something?

You brought this up in the thread that I started. Before I answer, I’ll admit that it’s entirely possible I’m thinking about this completely wrong, and have the wrong mental model of what you’re proposing. I’m relatively new to this sort of thing, so I’m definitely open to being educated. That said:

Wouldn’t join rely on the main data having samples as rows and features as columns to match the metadata? I do this sometimes, but it often results in DataFrames that are tens of thousands of columns (with only a few hundred rows). One of the reasons that I’ve found this problematic is that I often need to get do calculations on features based on each sample.

One common example: in my sample data, each microbe has a count, and I want to convert that to relative abundance (the count over the sum of counts of all species). This is a within-sample property - if samples are rows in a table along with (what I’m calling) sample metadata, I have to first select columns that are my microbial species (this is complicated now but might be helped by having column labels), then I need to convert the DataFrame to a matrix (since I need to do calculations on rows, and things like sum are not defined for rows of DataFrames), then do the calculation (and I gather that taking eg the sum of a row vector is less efficient than for a column vector).

Of course, I can just hold on to the original table where samples remain as columns, but then we’re back to the same problem. Another solution would be to just do the calculations on each sample while they’re in columns before combining them into the sample-as-row table, but I often need different calculations for different subsets of the samples depending on the patient data.

The issue isn’t column-specific metadata, I was talking about the approach discussed here, of attaching the metadata to the column itself. If the columns are not separate, what would you attach the column’s metadata to?

Looks like a possible storage format for your data would be to have one row for each species × sample combination, with one column with the species name, one with the sample ID, and meta-data as additional columns. This kind of data organization is relatively convenient to work with: you can easily compute sums by species using groupby or select subsets of rows. Actually I think it’s the format dplyr and tidyr recommend.

I’ve been lurking this thread a bit, and as a biologist that has used Julia to deal with albeit much smaller data sets, I can sympathize with @kevbonham and @gcalderone . Nevertheless, one useful way of thinking about metadata is that it is data that has not yet been parsed and added to the “real data” and may contain many many missing data. So while it is convenient (for me) to momentarily keep some data untouched because it is not clear how exactly it should be parsed at that specific stage of the analysis AND because it may be just a ton of missings, there are often many ways to add it to the (in this case) DataFrame such that it retains all of its properties (e.g. chunk of text as a String).

This seems safe, but if you think about it, lots of operations like select silently copy the dataframe, right? Having metadata be persistent, as long as the variable stays with the dataframe its Dict entry stays, would be ideal for me.

Interesting. This is definitely doable, it just seems like a huge amount of data duplication - Eg I’ll typically have hundreds of groups of thousands or tens of thousands of rows that are identical in all but one column. This format might be tidy but hardly seems efficient.

The visual way I think of my data is as perpendicular planes where the shared edge are samples. This is like filling in the cube and then flattening it…

I am fascinated by the delegation approach for two reasons:

  • the implementation is conceptually very easy (although practically quite difficult);
  • Julia fosters composition of data structures, in place of inheritance.

Hence, I created a composite structure as follows:

mutable struct DataFrame_Metadata <: AbstractDataFrame
    meta::MetadataDict
    data::DataFrame
end

and asked myself what should I do to use the new structure in place of DataFrame, maintaining exactly the same syntax. In other words, I want a DataFrame_Metadata object to behave exactly as a DataFrame object.

In a OOP language this is straightforward, but since I am now in love with Julia I want to solve this problem in the Julian way.

The steps to be performed are:

  1. re-directs all access to a DataFrame object to a field of the DataFrame_Metadata structure by re-defining all the methods accepting a DataFrame object;
  2. tweak these methods to propagate the metadata through DataFrames copies/slices/views;
  3. add methods to access the metdata.

As I said, the step 1 is conceptually very easy but a quick look with methodswith shows that I need to re-define 226 methods!!! Too much for my poor fingers, hence I wrote a program which uses the output from methodswith(DataFrame) to generate all the relevant methods definitions.

If you’re curious, this is the output: https://drive.google.com/file/d/1RW4VpkbYsjbiIzuETJ_0Q0lHio-7cuC7/view?usp=sharing

If you want to test it you can simply download it, include it, and use a DataFrame_Metadata object in exactly the same way you would use a DataFrame one. A few simple tests shows that it behave correctly.

So far I just implemented step 1. Step 2 would be much more demanding since there is no simple way to automatize it, hence I will need to look at all 226 methods. Finally, step 3 is very easy.

My conclusions for this experiment:

  • step 1 can be automatized, hence I believe it could be a nice feature to implement in post v1.0 versions of Julia;
  • with a single composition level I had to add 226 method, and the number will quickly explode as soon as new levels are added. For instance, I could define new structures encapsulating the DataFrame_Metadata one specifically designed for astronomy or biology;

Given the above, I am no longer sure that the Julian way (i.e. composition over inheritance) is appropriate to solve this problem, and that maybe we hit a serious limit. I am likely wrong, but I would appreciate if someone more expert than me could discuss how to solve this problem.

Thanks!

As I mentioned above, having a MetaDataFrame wrapper for a DataFrame with metadata means that if someone defines a new method for a DataFrame, either the person who writes the new method or the maintainer of MetaDataFrame has to add that method to a MetaDataFrame. Without a Julian class inheritance system (which I have no understanding about the prospects for), the ultimate result of this system is that people wanting MetaData will only be able to use a subset of the features that other users will.

On the other hand. “weak these methods to propagate the metadata through DataFrames copies/slices/views” doesn’t bother me that much. I am not sure how automated I imagine adding metadata to be, as I would probably want to add the notes manually.

Yes, “tidy data” is often very redundant. Do you have so much data that is a concern for you in practice?

Ideally, abstract types should have a well-defined interface, ie a collection of methods. Adding a method changes the interface, and should be an event rare enough to keep up with (it definitely warrants a bump in the minor version, so it can be caught). Otherwise, users who write methods build on the existing interface.

A think wrapper for DataFrames metadata is certainly better than a new vector type. However I would still like metadata in DataFrames because i really do think it is a fundamentally useful feature and once enacted will be widely used. I can’t really imagine working with a dataset and not wanting to label variables for ease of use. Otherwise I just get lost.