I agree both solutions are reasonable: store the meta-data in the data frame, or in each column vector. Each has advantages and drawbacks:
- In the data frame:
- Advantages: columns are just standard
Vector
objects, which is simpler for users (it feels kind of weird to avoid using the standard array type in very common use cases) and to implement - Drawbacks: meta-data is lost as soon as a column is used separately from its data frame
- Advantages: columns are just standard
- In the column vector:
- Advantages: meta-data is kept with the data it describes and some functions could make use of that even without supporting data frames
- Drawbacks: requires using a custom array type; can become quite involved if you need a
MetaDataVector{CategoricalVector{...}}
(which will be quite common). For example, we’ll need customrecode
methods which call the efficientCategoricalArray
method instead of the fallbackAbstractArray
one. Any custom array type may need similar tricks, which (currently) requires one package to depend on the other.
It seems that dplyr has chosen the second approach (see these issues). In particular, haven::read_dta
stores meta-data from Stata as attributes of the column vectors. The memisc, Hmisc and labelled do the same. But of course R is much more limited in terms of dispatch, so the situation is quite different.
Something we should also think about is how meta-data could be preserved across streaming operations with Query and DataStreams. CategoricalArray
handles this via the special CategoricalValue
element type, but a MetaDataArray
just contains normal entries, so you can’t retrieve the meta-data from individual entries. I’m not sure which of the two ways of storing meta-data would be easier to handle with Query and DataStreams. This matters in particular when combining Query with plots à la StatsPlots. We should probably resolve this issue before choosing what’s the most appropriate approach. Maybe an extension of schemas to include meta-data in addition to column names and types would be useful; in that case either approach would be OK. Cc: @davidanthoff @mkborregaard @quinnj .