How to add metadata info to a DataFrame?

ScottPJones · May 28, 2018, 2:41am

From my database background, it seems natural to me to have both metadata associated with the table as a whole, and with each column.
Why is there a debate over one or the other? They are not at all mutually exclusive.

davidanthoff · May 28, 2018, 4:42am

I can see metadata that makes sense for individual values (e.g. units of measurements), some for a whole column (“all counties in the US”) and some for a whole table. For quite a lot of stuff it is not super clear to me at which level that info belongs…

File formats seem to support varying types of metadata. Some support table level metadata, some column level metadata, and it is not clear to me whether any support value level metadata.

I think my gut reaction to this would be to support something that allows us to support the stuff that can be stored in the various file formats, and then stop there… Except for column selection, it seems to me that any column level metadata really can’t be preserved in a meaningful way through query operations. Even something like a filter operation should probably not preserve the metadata of a column (e.g. the column metadata says “complete list of US counties”).

Apart from the question about the logical data model here, there is then of course the question of implementation. It seems to me that for units of measures one would probably want to encode that in the value type. Beyond that, I’m not really sure…

I think for Query.jl and TableTraits.jl maybe the following would most make sense for now:

Query.jl operators just ignore any metadata, i.e. you lose your metadata when you pipe it through any operator. Except if the metadata is embedded in the column type (say via a number type that encodes units). It just seems almost impossible to figure out what the right semantics would be otherwise.
For TableTraits.jl I could imagine adding an optional interface for metadata. That would mainly enable two scenarios: loading and saving from disc could support meta data (if a file format supports it), and conversion between different table types could preserve metadata. Oh, and I guess a plotting library could in theory use the metadata, as long as it uses a table type or something from disc directly…

Tamas_Papp · May 28, 2018, 6:00am

@nalimilan: If it is just labels / explanations, I don’t see any problems. But in that case, why not just create a label field?

@pdeffebach: Dropping metadata when in doubt (eg after a transformation) is a reasonable approach.

nalimilan · May 28, 2018, 7:05am

The question is not one or the other, it’s how to implement column-specific meta-data.

On the contrary, column meta-data should always be preserved when selecting a subset of rows (this is what e.g. Stata does). Without this, meta-data wouldn’t be very useful, as it wouldn’t be available in many actual data sets people work with, which are often subsets of a larger original dataset. I don’t think the counter-example “Complete list of US counties” applies: it isn’t column-specific meta-data, it describes the whole data set (i.e. you have one row for each county). Typical meta-data describes the contents of a column (“US county”) – and we can make this a rule if we want.

I think it would make sense to be able to preserve column meta-data when selecting columns without transforming them. Of course in terms of implementation, that’s not trivial. But in terms of semantics I think it’s clear what the behavior should be. Otherwise we’re going to get complaints from people who wonder why their meta-data was lost when subsetting using Query while it is preserved when doing the same operation with getindex.

Because labels are not the only kind of meta-data you may want to store. As I noted survey questions or notes are also frequently useful, and for internal use one may want to define special keys. So having a more general system sounds useful.

Tamas_Papp · May 28, 2018, 8:02am

One thing I still don’t understand is why this needs to be implemented by DataFrames, not just another <: AbstractVector type that wraps another <: AbstractVector + metadata, and implements a

metadata(::AbstractVector) = nothing

(or similar) fallback. It is my understanding that

this would work with DataFrames out of the box, like all <: AbstractVectors do,
subsetting either columns or rows (with Base.getindex etc implemented) would just work, too,
transformations and maps would automatically drop the metadata.

Maybe Base.similar should propagate the metadata though for this vector type unless element types differ.

gcalderone · May 28, 2018, 8:44am

I think that the need for metadata becomes compelling when reading data from a source (a website, a file, a database, etc.), pass them to another environment (e.g. data analysis package, plotting package, etc.), and ensure that all the relevant informations are delivered, not just the numbers.

Therefore, an array-based approach (such as MetadataArrays) seems to me at best limited, and my proposal was indeed to attach metadata to “complete” data structures such as a DataFrame. Clearly other data structures can benefit as well from metadata facilities, but DataFrame is the first that pops into my mind.

I agree with @davidanthoff:

For instance, supporting the FITS file format amounts exactly to what I’m proposing here and can be readily implemented with the metadata support implemented in my PR.

Finally, besides the obvious unit/plot label, It seems to me that no important use case for column metadata has been illustrated here. Only @kevbonham provided a use case:

But I didn’t understood what do the rows of such table contain…

Tamas_Papp · May 28, 2018, 8:54am

Just to clarify: I was talking about an implementation of adding metadata to DataFrames by putting it in the columns as an <: AbstractVector wrapper.

I see this as a more modular and generic approach, that would work vectors contained in an outside of DataFrames.

piever · May 28, 2018, 10:41am

@Tamas_Papp : I think it is maybe a good time to decide on the semantics of this <: AbstractVector type. I’ve implemented more or less exactly what you describe in MetadataArrays.jl, with two differences:

It errors instead of returning nothing as a fallback. I’d tend to agree with you that returning nothing makes more sense.
Base.similar always keeps metadata in my implementation. What do you think is best? Always keep it, never keep it (and put nothing instead ?), or only keep it if element type is preserved, put nothing otherwise?

nalimilan · May 28, 2018, 11:52am

I have given a few potential reasons above. Have you missed them? I’m not saying they are totally decisive, but at least they show things are more complex than that.

Obviously the array-based approach would only be useful to store column meta-data. Something else would still be needed for global meta-data. But that doesn’t mean array-based meta-data aren’t useful. Let’s not mix these two design decisions which are completely orthogonal.

I’ve posted several examples of use cases and implementations in R, and Stata also supports column labels, so I think it’s clear it’s considered useful by a lot of people. Anyway there’s no need to oppose column meta-data to global meta-data, so if the former aren’t useful for you, you can just concentrate on the latter.

I guess it depends on what metadata returns, but if it returns a Dict as in MetaDataArrays currently, then it should either throw an error or return an empty dict as a fallback. But really that’s a secondary issue that would better be discussed in the GitHub project, we already have too many questions in this topic.

I’d say similar should drop meta-data. “similar” isn’t “identical”: it’s used to create a vector of the same type and shape, but it can be filled with anything. That the input is “GDP per capita” doesn’t mean the output will also be “GDP per capita” (or you’d just call copy). As a data point, similar(::CategoricalArray) does not preserve levels, because you are likely to put completely different data in the resulting array.

Tamas_Papp · May 28, 2018, 11:53am

It should return what can be construed as valid metadata in the API. Eg if metadata is a Dict{Any,Any}, it should return Dict{Any,Any}().
I use Base.similar for constructing empty containers when an algorithm is inconvenient to express otherwise, so I would keep the metadata.

Tamas_Papp · May 28, 2018, 12:10pm

If you are referring to this, I did not miss it, I just had the impression that you were arguing that while the implementation poses some challenges, it has the advantage of allowing metadata outside dataframes.

It is likely that I did not express myself clearly, but I was trying to ask about metadata for to the whole dataframe, not tied to columns.

gcalderone · May 28, 2018, 12:14pm

Fair enough!

Moreover, in both cases we deal with metadata, but the implementation details can be quite different.
Hence I propose to clearly state whether a given comment/suggestion/proposal aims to deal with column or global metadata or both.

Mine deals with both, but is admittently limited to DataFrames object.

ScottPJones · May 28, 2018, 12:20pm

I’m concerned that having the metadata attached to the columns themselves is not generic, it only makes sense for a column major store (and most all databases are store rows, except for some of the ones like Vertica).

For data coming from a database, you’d have metadata about the database, the tables in the database, and the columns in each table (which you’d get back as a vector, with the type information, labels etc. about each column in the rows).

If you do want to select some subset of the columns, it’s very easy in Julia to index that vector of metadata with the range or a vector of indexes to get that subset, just as you’d do with the columns themselves.

ScottPJones · May 28, 2018, 12:23pm

Maybe what’s needed is a “AbstractTable” (or is there already one?), which has an API for metadata,
not limited to DataFrames, handling both the table- and column- metadata.

gcalderone · May 28, 2018, 1:09pm

Having a delegation mechanism like:

mutable struct MetadataContainer{T}
  meta::Dict
  data::T
end

# add metadata to DataFrame
@delegate DataFrame MetadataContainer{DataFrame} ".data"

# add metadata to arrays
@delegate AbstractArray MetadataContainer{AbstractArray} ".data"

# add metadata to Int numbers
@delegate AbstractArray MetadataContainer{Int} ".data"

would be the best choice. After all, it’s just a matter of adding appropriate entries in the dispatch table, i.e.:

showcols(p1::MetadataContainer{DataFrame}) = showcols(p1.data)

This can be done through TypedDelegation, although all methods must be listed one by one.

A macro which automatically does this job, based on the methods returned by methodswith would be really useful. Do something similar already exists? Or will the core Julia ever support delegation?

Tamas_Papp · May 28, 2018, 1:20pm

Sorry but I don’t follow. Storage is orthogonal to the concept of “columns” (values with homogeneous types along on coordinate). Column-specific metadata would make sense even if one stored rows as tuples.

nalimilan · May 28, 2018, 1:48pm

Yes, that would be nice, but currently no such thing exists AFAIK.

Note, however, that it wouldn’t magically solve all problems: if you want getindex to preserve meta-data, you need to implement custom methods which delegate to the wrapped vector/data frame, and rewrap the result in a MetadataContainer. But at least by default all methods would work (and discard meta-data).

kevbonham · May 28, 2018, 3:37pm

Let me be concrete here. I work with human biological data, specifically microbiomes. A typical dataset is a series (on the order of hundreds) of samples, each of which has thousands to tens of thousands of measurements (the relative abundances of various microbial species). The way the data is currently structured is as a sparse matrix with samples as columns and rows as microbial features. Eg:

using DataFrames
taxa = DataFrame(species=["species_$x" for x in 1:10],
                    sample1=rand(10),
                    sample2=rand(10),
                    sample3=rand(10))

Many of these samples also have a ton of relevant clinical metadata associated with them, such as patient age, disease diagnosis, medication history etc. Usually, this comes to me in a separate table where rows are samples and columns are the type of information

metadata = DataFrame(sample=["sample1", "sample2", "sample3"],
                    age_in_years=[30, 25, 57],
                    diagnosis=["Healthy Control", "RA", missing],
                    antibiotics=[true, false, missing])

The metadata is often incomplete, and I only have certain information for certain samples, and some samples that don’t have any metadata. Some of my analyses don’t require any metadata, or only require some of it, so I often have a complex series of dicts and comprehentions that I have to do in order to subset data in various ways. This throws all of the orders of things out of whack (I’ve been bitten by not keeping track of the order of arrays), so I often use dicts of dicts etc, but have to deal with missings all the time and it’s an utter mess. I’d say trying to cope with this stuff is between 70 and 85% of my analysis time.

So to be clear and restate things - I need column-level metadata. The implementation details (whether it’s stored with the DataFrame or attached to the vectors themselves) matter less to me, but I would need the metadata to follow columns in subseting and views (I’ve recently started mixing in Query, but I mostly do it with indexing and view().

Another wrinkle to consider - I rather like the idea of tying this stuff to a generic table implementation, since SpatialEcology (which my package is now based off of) we use CommMatrixes that are wrappers around sparse matrices that have special functions associated with them. Having a way to use Metadata generically across many types of data representations (including DataTables, SparseMatrices… basically everything supported by IterableTables) would be really lovely (though also I’m sure a lot of work that would save me a heap of time, but that I don’t have the time to implement well).

kevbonham · May 28, 2018, 3:43pm

Yeah, this would be amazing. @mkborregaard wrote a macro in SpatialEcology that seems similar to the type delegation you linked to, but also requires explicitly pointing out the methods you want forwarded.

I suspect it’s slightly (or much) more complicated because methods that operate on multiple arguments also need to be included.

Tamas_Papp · May 28, 2018, 3:51pm

I am not sure this is metadata; this should fit nicely in dataframes that describe various levels of the experiment, and should be amenable to formatting to “tidy data” using join. Am I missing something?

Topic		Replies	Views
Attaching simple metadata to types General Usage	1	344	October 21, 2021
DataFrames.jl: metadata Data package , dataframes , metadata	149	6453	April 24, 2023
Copy metadata between DataFrames Machine Learning mljlinearmodels	2	203	September 30, 2024
Add metadata to categorical array Data	4	139	July 16, 2024
Writing dataframe to arrow format with column metadata Data	6	512	October 6, 2023

How to add metadata info to a DataFrame?

Related topics