Storing both data and metadata in the same object or container is a very common approach in many research fields, and it has proven to be very successful in (for instance) astronomy where a few standards had been officialized and are now commonly used (e.g. FITS and VOTable).
In Julia a common way to represent tabular data is by means of a DataFrame, and I (as an astronomer) would like to associate metadata informations to such object, both to the table as a whole and to individual columns. But DataFrame has no support for metadata. This topic has already been discussed (e.g. here and here), but no solution has been implemented.
How can I add metadata support to a DataFrame object ?
To clarify the question as much as possible, let’s assume I wish to add a “source” label to the whole table to specify where the data comes from, and a “unit” label for each column specifying the units for the numbers. A String type for the label contents is fine, but a more general Any would be preferable.
In an OOP language I would simply use class inheritance, but this is not possible in Julia. Hence, I can foresee four possibilities, none of which seems optimal to me:
- Use composition to create a new struct as follows:
mutable struct type DataFrame_withMeta
meta::Any
data::DataFrame
end
and use DataFrame_withMeta objects in place of DataFrame ones. However, this will force me to to always add .data wherever a DataFrame object is needed, i.e. in the vast majority of cases. In other words, I would miss much of the very simple interoperability between DataFrames and other packages (e.g. Gadfly);
-
Use delegation of the above mentioned struct through TypedDelegation. Although effective, this approach appears quite tricky since it requires me to list all the possible methods accepting a
DataFrameobject; -
Inherit from the
AbstractDataFrametype:
mutable struct type DataFrame_withMeta <: AbstractDataFrame
meta::Any
...
end
where ... is the actual content of the DataFrame structure. This means that each time the DataFrame structure is changed/updated I will need to change also DataFrame_withMeta accordingly. Moreover, this would only allows interoperability with packages accepting AbstractDataFrame objects, not DataFrame ones;
- Issue a PR to the DataFrames.jl package maintainers where I simply add a
metafield to the relevant structures, to be accessed as follows:
df = DataFrame(:col1=>1:10, :col2=>rand(10))
a.meta[:source] = "www.some.site"
a.meta[:col1, :unit] = "km / s"
This is the easiest and most straightforward approach: do not adds any package dependency or breaking change. Moreover, it would allow packages which return DataFrame objects (such as RDatasets, CSV, etc.) to provide metadata, and packages who accept DataFrame objects (such as Gadfly) to exploit metadata informations.
The drawback is that this change will not add any new functionality to the DataFrames package itself since the meta facility will be mainly used by related packages. Hence, there is no point in adding it to the main DataFrames package.
Is there any other simple and effective solution ?
(sorry for the very long question…
)