How to add metadata info to a DataFrame?

I agree both solutions are reasonable: store the meta-data in the data frame, or in each column vector. Each has advantages and drawbacks:

  • In the data frame:
    • Advantages: columns are just standard Vector objects, which is simpler for users (it feels kind of weird to avoid using the standard array type in very common use cases) and to implement
    • Drawbacks: meta-data is lost as soon as a column is used separately from its data frame
  • In the column vector:
    • Advantages: meta-data is kept with the data it describes and some functions could make use of that even without supporting data frames
    • Drawbacks: requires using a custom array type; can become quite involved if you need a MetaDataVector{CategoricalVector{...}} (which will be quite common). For example, we’ll need custom recode methods which call the efficient CategoricalArray method instead of the fallback AbstractArray one. Any custom array type may need similar tricks, which (currently) requires one package to depend on the other.

It seems that dplyr has chosen the second approach (see these issues). In particular, haven::read_dta stores meta-data from Stata as attributes of the column vectors. The memisc, Hmisc and labelled do the same. But of course R is much more limited in terms of dispatch, so the situation is quite different.

Something we should also think about is how meta-data could be preserved across streaming operations with Query and DataStreams. CategoricalArray handles this via the special CategoricalValue element type, but a MetaDataArray just contains normal entries, so you can’t retrieve the meta-data from individual entries. I’m not sure which of the two ways of storing meta-data would be easier to handle with Query and DataStreams. This matters in particular when combining Query with plots à la StatsPlots. We should probably resolve this issue before choosing what’s the most appropriate approach. Maybe an extension of schemas to include meta-data in addition to column names and types would be useful; in that case either approach would be OK. Cc: @davidanthoff @mkborregaard @quinnj .