How to add metadata info to a DataFrame?

You brought this up in the thread that I started. Before I answer, I’ll admit that it’s entirely possible I’m thinking about this completely wrong, and have the wrong mental model of what you’re proposing. I’m relatively new to this sort of thing, so I’m definitely open to being educated. That said:

Wouldn’t join rely on the main data having samples as rows and features as columns to match the metadata? I do this sometimes, but it often results in DataFrames that are tens of thousands of columns (with only a few hundred rows). One of the reasons that I’ve found this problematic is that I often need to get do calculations on features based on each sample.

One common example: in my sample data, each microbe has a count, and I want to convert that to relative abundance (the count over the sum of counts of all species). This is a within-sample property - if samples are rows in a table along with (what I’m calling) sample metadata, I have to first select columns that are my microbial species (this is complicated now but might be helped by having column labels), then I need to convert the DataFrame to a matrix (since I need to do calculations on rows, and things like sum are not defined for rows of DataFrames), then do the calculation (and I gather that taking eg the sum of a row vector is less efficient than for a column vector).

Of course, I can just hold on to the original table where samples remain as columns, but then we’re back to the same problem. Another solution would be to just do the calculations on each sample while they’re in columns before combining them into the sample-as-row table, but I often need different calculations for different subsets of the samples depending on the patient data.

The issue isn’t column-specific metadata, I was talking about the approach discussed here, of attaching the metadata to the column itself. If the columns are not separate, what would you attach the column’s metadata to?

Looks like a possible storage format for your data would be to have one row for each species × sample combination, with one column with the species name, one with the sample ID, and meta-data as additional columns. This kind of data organization is relatively convenient to work with: you can easily compute sums by species using groupby or select subsets of rows. Actually I think it’s the format dplyr and tidyr recommend.

I’ve been lurking this thread a bit, and as a biologist that has used Julia to deal with albeit much smaller data sets, I can sympathize with @kevbonham and @gcalderone . Nevertheless, one useful way of thinking about metadata is that it is data that has not yet been parsed and added to the “real data” and may contain many many missing data. So while it is convenient (for me) to momentarily keep some data untouched because it is not clear how exactly it should be parsed at that specific stage of the analysis AND because it may be just a ton of missings, there are often many ways to add it to the (in this case) DataFrame such that it retains all of its properties (e.g. chunk of text as a String).

This seems safe, but if you think about it, lots of operations like select silently copy the dataframe, right? Having metadata be persistent, as long as the variable stays with the dataframe its Dict entry stays, would be ideal for me.

Interesting. This is definitely doable, it just seems like a huge amount of data duplication - Eg I’ll typically have hundreds of groups of thousands or tens of thousands of rows that are identical in all but one column. This format might be tidy but hardly seems efficient.

The visual way I think of my data is as perpendicular planes where the shared edge are samples. This is like filling in the cube and then flattening it…

I am fascinated by the delegation approach for two reasons:

  • the implementation is conceptually very easy (although practically quite difficult);
  • Julia fosters composition of data structures, in place of inheritance.

Hence, I created a composite structure as follows:

mutable struct DataFrame_Metadata <: AbstractDataFrame
    meta::MetadataDict
    data::DataFrame
end

and asked myself what should I do to use the new structure in place of DataFrame, maintaining exactly the same syntax. In other words, I want a DataFrame_Metadata object to behave exactly as a DataFrame object.

In a OOP language this is straightforward, but since I am now in love with Julia I want to solve this problem in the Julian way.

The steps to be performed are:

  1. re-directs all access to a DataFrame object to a field of the DataFrame_Metadata structure by re-defining all the methods accepting a DataFrame object;
  2. tweak these methods to propagate the metadata through DataFrames copies/slices/views;
  3. add methods to access the metdata.

As I said, the step 1 is conceptually very easy but a quick look with methodswith shows that I need to re-define 226 methods!!! Too much for my poor fingers, hence I wrote a program which uses the output from methodswith(DataFrame) to generate all the relevant methods definitions.

If you’re curious, this is the output: https://drive.google.com/file/d/1RW4VpkbYsjbiIzuETJ_0Q0lHio-7cuC7/view?usp=sharing

If you want to test it you can simply download it, include it, and use a DataFrame_Metadata object in exactly the same way you would use a DataFrame one. A few simple tests shows that it behave correctly.

So far I just implemented step 1. Step 2 would be much more demanding since there is no simple way to automatize it, hence I will need to look at all 226 methods. Finally, step 3 is very easy.

My conclusions for this experiment:

  • step 1 can be automatized, hence I believe it could be a nice feature to implement in post v1.0 versions of Julia;
  • with a single composition level I had to add 226 method, and the number will quickly explode as soon as new levels are added. For instance, I could define new structures encapsulating the DataFrame_Metadata one specifically designed for astronomy or biology;

Given the above, I am no longer sure that the Julian way (i.e. composition over inheritance) is appropriate to solve this problem, and that maybe we hit a serious limit. I am likely wrong, but I would appreciate if someone more expert than me could discuss how to solve this problem.

Thanks!

As I mentioned above, having a MetaDataFrame wrapper for a DataFrame with metadata means that if someone defines a new method for a DataFrame, either the person who writes the new method or the maintainer of MetaDataFrame has to add that method to a MetaDataFrame. Without a Julian class inheritance system (which I have no understanding about the prospects for), the ultimate result of this system is that people wanting MetaData will only be able to use a subset of the features that other users will.

On the other hand. “weak these methods to propagate the metadata through DataFrames copies/slices/views” doesn’t bother me that much. I am not sure how automated I imagine adding metadata to be, as I would probably want to add the notes manually.

Yes, “tidy data” is often very redundant. Do you have so much data that is a concern for you in practice?

Ideally, abstract types should have a well-defined interface, ie a collection of methods. Adding a method changes the interface, and should be an event rare enough to keep up with (it definitely warrants a bump in the minor version, so it can be caught). Otherwise, users who write methods build on the existing interface.

A think wrapper for DataFrames metadata is certainly better than a new vector type. However I would still like metadata in DataFrames because i really do think it is a fundamentally useful feature and once enacted will be widely used. I can’t really imagine working with a dataset and not wanting to label variables for ease of use. Otherwise I just get lost.

Same here, which is why I use descriptive variable names, both for dataframe columns and variables in general.

OK, let’s summarize a little bit:

  • Adding metadata to DataFrame (or similar structures) is very useful;
  • There is no point (by now) in trying to attach semantic meaning to metadata. The best we can do is to appropriately propagate metadata through copies/slices/views, and discard metadata when more complicated transformations are involved;
  • Two approaches can be envisioned: adding metadata to the DataFrame (or similar structures) as a whole, or attach them to individual Arrays. Both have pros and cons, but likely we will need both;
  • By now, a reasonable way to store metadata is a Dict{Symbol, Any}, regardless of the followed approach;
  • Concerning the implementation with DataFrames:
    • there is a PR to encapsulate metadata within a DataFrame structure, both at a global and column level. The flaw in this implementation is that the metadata, stored in the colmeta field of the DataFrame structure, do not add any functionality to the package itself. It would be much better to leave the DataFrames package as it currently is and wrap it in a container along with metadata;
    • I tried the wrapping approach here, but it turns out there is a lot of boilerplate code to be written;
    • the difficulty in extending the DataFrame object may ultimately lies in the way the package has been implemented. E.g., the function Base.getindex(df::DataFrame, col_ind::ColumnIndex) in the package should actually accept an AbstractDataFrame as input, not a DataFrame. Moreover the SubDataFrame struct inherits from AbstractDataFrame, but the SubDataFrame and DataFrame structures do not share the same fields.
    • I am not sure these are issues or intended design decisions for the DataFrames package, but they don’t allow the DataFrame code to be easily re-used (see here for a discussion on code reusing by means of composition).
5 Likes

Good summary. I don’t think wrapping DataFrame objects is a really good solution. It doesn’t have the advantage of storing meta-data with column vectors (which is that meta-data is available when vectors are passed separately), but it suffers from its complications (which are related to delegating methods). It sounds much simpler to store the meta-data directly in the DataFrame object: that doesn’t make the code significantly more complex and it doesn’t hurt performance for people who don’t use meta-data.

Overall I’m inclined to add meta-data support to DataFrame because it’s simpler and we have a PR for that. @quinnj said he’s also in favor of that approach.

3 Likes

Apologies for the delay. I finally had a chance to benchmark this. On a moderately small (but typical) dataset, my abundance table (microbial species as rows, samples as columns) with 541 samples and 578 species, plus a metadata table (samples as rows, metadata features as columns), with 316 samples and 28 types of metadata, the “tidy” version is a dataframe with 312,223 rows and 31 columns. Generation of the “tidy” table and joining with the metadata (with an outer join) was less than 5 sec (I didn’t actually benchmark, but it’s trivial in the scheme of setup.

To do my “relative abundance” calculation I mentioned before, I wrote two functions:

# for the abundance table matrix, species x samples
function relab_tax!(df::DataFrame)
    for n in names(df[2:end])
        s_sum = sum(df[n])
        for sp in eachindex(df[n])
            df[sp, n] /= s_sum
        end
    end
end

# for the "tidy" version, each row is a unique species x sample pair
function relab_tidy!(df::DataFrame)
    by(df, :iamc_id) do relab
        s_sum = sum(relab[:abundance])
        relab[:abundance] ./= s_sum
    end
end

Here are the results of the benchmark:

julia> @benchmark relab_tax!(df) setup=(df=copy($tax))
BenchmarkTools.Trial:
  memory estimate:  31.44 MiB
  allocs estimate:  1743404
  --------------
  minimum time:     106.835 ms (11.74% GC)
  median time:      119.444 ms (12.22% GC)
  mean time:        121.497 ms (15.89% GC)
  maximum time:     136.442 ms (19.55% GC)
  --------------
  samples:          41
  evals/sample:     1

julia> @benchmark relab_tidy!(tidy) setup=(df=copy($tidy))
BenchmarkTools.Trial:
  memory estimate:  96.30 MiB
  allocs estimate:  4209389
  --------------
  minimum time:     178.097 ms (41.25% GC)
  median time:      276.739 ms (59.45% GC)
  mean time:        337.577 ms (67.36% GC)
  maximum time:     624.544 ms (80.42% GC)
  --------------
  samples:          15
  evals/sample:     1

So here, the tidy version is generally ~2-3x slower than the other version, but it’s actually not so much worse, considering I’m doing this in an EDA context. I was a little bit worried about some of my other datasets, which are much bigger, but realized that this is sort of cheating, since a lot of the abundances are zeros. Filtering those out, I actually get quite an improvement.

julia> @benchmark relab_tidy!(tidy_sparse) setup=(df=copy($tidy_sparse))
BenchmarkTools.Trial:
  memory estimate:  16.62 MiB
  allocs estimate:  686601
  --------------
  minimum time:     23.862 ms (0.00% GC)
  median time:      44.797 ms (42.77% GC)
  mean time:        40.609 ms (37.19% GC)
  maximum time:     56.512 ms (40.34% GC)
  --------------
  samples:          123
  evals/sample:     1

I’m guessing this could be optimized even further, but considering how straightforward this is, I think @Tamas_Papp wins this argument… :stuck_out_tongue:

2 Likes

It was just a guess; I deviate from the tidy data layout quite frequently myself for large (> 10 GB) datasets. Developing <: AbstractVector types which support very redundant layouts that come from typical “tidy” operations could be worth experimenting with.

Thanks for the careful benchmarks, these are very useful for future readers of this topic. :+1:

Also note that by is currently much slower than it could be.

Added an attempt at metadata here. I would be interested in people’s feedback, and since people will always be worried about performance, any help on making the code as performant as possible, especially for those that do not intend on using metadata in their work.

I definitely like that you seem to have kept all of the new metadata stuff completely out of the existing column stuff and the underlying AbstractVectors.

That said, wouldn’t it make more sense to have another AbstractDataFrame for this, like DataFrameWithMetdata (or a less horrible name)? I’m a bit wary of tacking new features onto DataFrame which are only relevant for a small minority of use cases, to me the extreme simplicity of DataFrames is a huge part of the appeal and I might even argue one of the reasons the package will still have for existing once JuliaDB matures. Another reservation I have about this concept is that it’s a bit hard to foresee the particular details of metadata use cases, so I’m not sure whether any of these attempts are both general enough to be applicable and specific enough to be useful. I wonder if energy is better spent making user-defined AbstractDataFrames easier to implement.

2 Likes

There is a lot of discussion about this already in this thread and others. I hope I have been able to convince even a few people that support for metadata living in dataframes would have a wide userbase. Everyone who uses stata uses metadata, and everyone who works with survey data or large datasets form the world bank or genomics needs metadata to stay organized. Its also a useful tool for document large amounts of data cleaning. I hope that my implementation adds as small a footprint as possible to the physics-based and other communities that use it.