Storing both data and metadata in the same object or container is a very common approach in many research fields, and it has proven to be very successful in (for instance) astronomy where a few standards had been officialized and are now commonly used (e.g. FITS and VOTable).
In Julia a common way to represent tabular data is by means of a
DataFrame, and I (as an astronomer) would like to associate metadata informations to such object, both to the table as a whole and to individual columns. But
DataFrame has no support for metadata. This topic has already been discussed (e.g. here and here), but no solution has been implemented.
How can I add metadata support to a DataFrame object ?
To clarify the question as much as possible, let’s assume I wish to add a “source” label to the whole table to specify where the data comes from, and a “unit” label for each column specifying the units for the numbers. A
String type for the label contents is fine, but a more general
Any would be preferable.
In an OOP language I would simply use class inheritance, but this is not possible in Julia. Hence, I can foresee four possibilities, none of which seems optimal to me:
- Use composition to create a new struct as follows:
mutable struct type DataFrame_withMeta meta::Any data::DataFrame end
DataFrame_withMeta objects in place of
DataFrame ones. However, this will force me to to always add
.data wherever a
DataFrame object is needed, i.e. in the vast majority of cases. In other words, I would miss much of the very simple interoperability between
DataFrames and other packages (e.g. Gadfly);
Use delegation of the above mentioned struct through TypedDelegation. Although effective, this approach appears quite tricky since it requires me to list all the possible methods accepting a
Inherit from the
mutable struct type DataFrame_withMeta <: AbstractDataFrame meta::Any ... end
... is the actual content of the
DataFrame structure. This means that each time the
DataFrame structure is changed/updated I will need to change also
DataFrame_withMeta accordingly. Moreover, this would only allows interoperability with packages accepting
AbstractDataFrame objects, not
- Issue a PR to the DataFrames.jl package maintainers where I simply add a
metafield to the relevant structures, to be accessed as follows:
df = DataFrame(:col1=>1:10, :col2=>rand(10)) a.meta[:source] = "www.some.site" a.meta[:col1, :unit] = "km / s"
This is the easiest and most straightforward approach: do not adds any package dependency or breaking change. Moreover, it would allow packages which return
DataFrame objects (such as RDatasets, CSV, etc.) to provide metadata, and packages who accept
DataFrame objects (such as Gadfly) to exploit metadata informations.
The drawback is that this change will not add any new functionality to the
DataFrames package itself since the
meta facility will be mainly used by related packages. Hence, there is no point in adding it to the main
Is there any other simple and effective solution ?
(sorry for the very long question… )