How to add metadata info to a DataFrame?

From my database background, it seems natural to me to have both metadata associated with the table as a whole, and with each column.
Why is there a debate over one or the other? They are not at all mutually exclusive.

I can see metadata that makes sense for individual values (e.g. units of measurements), some for a whole column (ā€œall counties in the USā€) and some for a whole table. For quite a lot of stuff it is not super clear to me at which level that info belongsā€¦

File formats seem to support varying types of metadata. Some support table level metadata, some column level metadata, and it is not clear to me whether any support value level metadata.

I think my gut reaction to this would be to support something that allows us to support the stuff that can be stored in the various file formats, and then stop thereā€¦ Except for column selection, it seems to me that any column level metadata really canā€™t be preserved in a meaningful way through query operations. Even something like a filter operation should probably not preserve the metadata of a column (e.g. the column metadata says ā€œcomplete list of US countiesā€).

Apart from the question about the logical data model here, there is then of course the question of implementation. It seems to me that for units of measures one would probably want to encode that in the value type. Beyond that, Iā€™m not really sureā€¦

I think for Query.jl and TableTraits.jl maybe the following would most make sense for now:

  • Query.jl operators just ignore any metadata, i.e. you lose your metadata when you pipe it through any operator. Except if the metadata is embedded in the column type (say via a number type that encodes units). It just seems almost impossible to figure out what the right semantics would be otherwise.
  • For TableTraits.jl I could imagine adding an optional interface for metadata. That would mainly enable two scenarios: loading and saving from disc could support meta data (if a file format supports it), and conversion between different table types could preserve metadata. Oh, and I guess a plotting library could in theory use the metadata, as long as it uses a table type or something from disc directlyā€¦
4 Likes

@nalimilan: If it is just labels / explanations, I donā€™t see any problems. But in that case, why not just create a label field?

@pdeffebach: Dropping metadata when in doubt (eg after a transformation) is a reasonable approach.

1 Like

The question is not one or the other, itā€™s how to implement column-specific meta-data.

On the contrary, column meta-data should always be preserved when selecting a subset of rows (this is what e.g. Stata does). Without this, meta-data wouldnā€™t be very useful, as it wouldnā€™t be available in many actual data sets people work with, which are often subsets of a larger original dataset. I donā€™t think the counter-example ā€œComplete list of US countiesā€ applies: it isnā€™t column-specific meta-data, it describes the whole data set (i.e. you have one row for each county). Typical meta-data describes the contents of a column (ā€œUS countyā€) ā€“ and we can make this a rule if we want.

I think it would make sense to be able to preserve column meta-data when selecting columns without transforming them. Of course in terms of implementation, thatā€™s not trivial. But in terms of semantics I think itā€™s clear what the behavior should be. Otherwise weā€™re going to get complaints from people who wonder why their meta-data was lost when subsetting using Query while it is preserved when doing the same operation with getindex.

Because labels are not the only kind of meta-data you may want to store. As I noted survey questions or notes are also frequently useful, and for internal use one may want to define special keys. So having a more general system sounds useful.

4 Likes

One thing I still donā€™t understand is why this needs to be implemented by DataFrames, not just another <: AbstractVector type that wraps another <: AbstractVector + metadata, and implements a

metadata(::AbstractVector) = nothing

(or similar) fallback. It is my understanding that

  1. this would work with DataFrames out of the box, like all <: AbstractVectors do,
  2. subsetting either columns or rows (with Base.getindex etc implemented) would just work, too,
  3. transformations and maps would automatically drop the metadata.

Maybe Base.similar should propagate the metadata though for this vector type unless element types differ.

2 Likes

I think that the need for metadata becomes compelling when reading data from a source (a website, a file, a database, etc.), pass them to another environment (e.g. data analysis package, plotting package, etc.), and ensure that all the relevant informations are delivered, not just the numbers.

Therefore, an array-based approach (such as MetadataArrays) seems to me at best limited, and my proposal was indeed to attach metadata to ā€œcompleteā€ data structures such as a DataFrame. Clearly other data structures can benefit as well from metadata facilities, but DataFrame is the first that pops into my mind.

I agree with @davidanthoff:

For instance, supporting the FITS file format amounts exactly to what Iā€™m proposing here and can be readily implemented with the metadata support implemented in my PR.

Finally, besides the obvious unit/plot label, It seems to me that no important use case for column metadata has been illustrated here. Only @kevbonham provided a use case:

But I didnā€™t understood what do the rows of such table containā€¦

1 Like

Just to clarify: I was talking about an implementation of adding metadata to DataFrames by putting it in the columns as an <: AbstractVector wrapper.

I see this as a more modular and generic approach, that would work vectors contained in an outside of DataFrames.

@Tamas_Papp : I think it is maybe a good time to decide on the semantics of this <: AbstractVector type. Iā€™ve implemented more or less exactly what you describe in MetadataArrays.jl, with two differences:

  1. It errors instead of returning nothing as a fallback. Iā€™d tend to agree with you that returning nothing makes more sense.
  2. Base.similar always keeps metadata in my implementation. What do you think is best? Always keep it, never keep it (and put nothing instead ?), or only keep it if element type is preserved, put nothing otherwise?

I have given a few potential reasons above. Have you missed them? Iā€™m not saying they are totally decisive, but at least they show things are more complex than that.

Obviously the array-based approach would only be useful to store column meta-data. Something else would still be needed for global meta-data. But that doesnā€™t mean array-based meta-data arenā€™t useful. Letā€™s not mix these two design decisions which are completely orthogonal.

Iā€™ve posted several examples of use cases and implementations in R, and Stata also supports column labels, so I think itā€™s clear itā€™s considered useful by a lot of people. Anyway thereā€™s no need to oppose column meta-data to global meta-data, so if the former arenā€™t useful for you, you can just concentrate on the latter.

I guess it depends on what metadata returns, but if it returns a Dict as in MetaDataArrays currently, then it should either throw an error or return an empty dict as a fallback. But really thatā€™s a secondary issue that would better be discussed in the GitHub project, we already have too many questions in this topic.

Iā€™d say similar should drop meta-data. ā€œsimilarā€ isnā€™t ā€œidenticalā€: itā€™s used to create a vector of the same type and shape, but it can be filled with anything. That the input is ā€œGDP per capitaā€ doesnā€™t mean the output will also be ā€œGDP per capitaā€ (or youā€™d just call copy). As a data point, similar(::CategoricalArray) does not preserve levels, because you are likely to put completely different data in the resulting array.

1 Like
  1. It should return what can be construed as valid metadata in the API. Eg if metadata is a Dict{Any,Any}, it should return Dict{Any,Any}().

  2. I use Base.similar for constructing empty containers when an algorithm is inconvenient to express otherwise, so I would keep the metadata.

If you are referring to this, I did not miss it, I just had the impression that you were arguing that while the implementation poses some challenges, it has the advantage of allowing metadata outside dataframes.

It is likely that I did not express myself clearly, but I was trying to ask about metadata for to the whole dataframe, not tied to columns.

Fair enough! :wink:

Moreover, in both cases we deal with metadata, but the implementation details can be quite different.
Hence I propose to clearly state whether a given comment/suggestion/proposal aims to deal with column or global metadata or both.

Mine deals with both, but is admittently limited to DataFrames object.

Iā€™m concerned that having the metadata attached to the columns themselves is not generic, it only makes sense for a column major store (and most all databases are store rows, except for some of the ones like Vertica).

For data coming from a database, youā€™d have metadata about the database, the tables in the database, and the columns in each table (which youā€™d get back as a vector, with the type information, labels etc. about each column in the rows).

If you do want to select some subset of the columns, itā€™s very easy in Julia to index that vector of metadata with the range or a vector of indexes to get that subset, just as youā€™d do with the columns themselves.

Maybe whatā€™s needed is a ā€œAbstractTableā€ (or is there already one?), which has an API for metadata,
not limited to DataFrames, handling both the table- and column- metadata.

Having a delegation mechanism like:

mutable struct MetadataContainer{T}
  meta::Dict
  data::T
end

# add metadata to DataFrame
@delegate DataFrame MetadataContainer{DataFrame} ".data"

# add metadata to arrays
@delegate AbstractArray MetadataContainer{AbstractArray} ".data"

# add metadata to Int numbers
@delegate AbstractArray MetadataContainer{Int} ".data"

would be the best choice. After all, itā€™s just a matter of adding appropriate entries in the dispatch table, i.e.:

showcols(p1::MetadataContainer{DataFrame}) = showcols(p1.data)

This can be done through TypedDelegation, although all methods must be listed one by one.

A macro which automatically does this job, based on the methods returned by methodswith would be really useful. Do something similar already exists? Or will the core Julia ever support delegation?

Sorry but I donā€™t follow. Storage is orthogonal to the concept of ā€œcolumnsā€ (values with homogeneous types along on coordinate). Column-specific metadata would make sense even if one stored rows as tuples.

Yes, that would be nice, but currently no such thing exists AFAIK.

Note, however, that it wouldnā€™t magically solve all problems: if you want getindex to preserve meta-data, you need to implement custom methods which delegate to the wrapped vector/data frame, and rewrap the result in a MetadataContainer. But at least by default all methods would work (and discard meta-data).

Let me be concrete here. I work with human biological data, specifically microbiomes. A typical dataset is a series (on the order of hundreds) of samples, each of which has thousands to tens of thousands of measurements (the relative abundances of various microbial species). The way the data is currently structured is as a sparse matrix with samples as columns and rows as microbial features. Eg:

using DataFrames
taxa = DataFrame(species=["species_$x" for x in 1:10],
                    sample1=rand(10),
                    sample2=rand(10),
                    sample3=rand(10))

Many of these samples also have a ton of relevant clinical metadata associated with them, such as patient age, disease diagnosis, medication history etc. Usually, this comes to me in a separate table where rows are samples and columns are the type of information

metadata = DataFrame(sample=["sample1", "sample2", "sample3"],
                    age_in_years=[30, 25, 57],
                    diagnosis=["Healthy Control", "RA", missing],
                    antibiotics=[true, false, missing])

The metadata is often incomplete, and I only have certain information for certain samples, and some samples that donā€™t have any metadata. Some of my analyses donā€™t require any metadata, or only require some of it, so I often have a complex series of dicts and comprehentions that I have to do in order to subset data in various ways. This throws all of the orders of things out of whack (Iā€™ve been bitten by not keeping track of the order of arrays), so I often use dicts of dicts etc, but have to deal with missings all the time and itā€™s an utter mess. Iā€™d say trying to cope with this stuff is between 70 and 85% of my analysis time.

So to be clear and restate things - I need column-level metadata. The implementation details (whether itā€™s stored with the DataFrame or attached to the vectors themselves) matter less to me, but I would need the metadata to follow columns in subseting and views (Iā€™ve recently started mixing in Query, but I mostly do it with indexing and view().

Another wrinkle to consider - I rather like the idea of tying this stuff to a generic table implementation, since SpatialEcology (which my package is now based off of) we use CommMatrixes that are wrappers around sparse matrices that have special functions associated with them. Having a way to use Metadata generically across many types of data representations (including DataTables, SparseMatricesā€¦ basically everything supported by IterableTables) would be really lovely (though also Iā€™m sure a lot of work that would save me a heap of time, but that I donā€™t have the time to implement well).

Yeah, this would be amazing. @mkborregaard wrote a macro in SpatialEcology that seems similar to the type delegation you linked to, but also requires explicitly pointing out the methods you want forwarded.

I suspect itā€™s slightly (or much) more complicated because methods that operate on multiple arguments also need to be included.

I am not sure this is metadata; this should fit nicely in dataframes that describe various levels of the experiment, and should be amenable to formatting to ā€œtidy dataā€ using join. Am I missing something?