Storing both data and metadata in the same object or container is a very common approach in many research fields, and it has proven to be very successful in (for instance) astronomy where a few standards had been officialized and are now commonly used (e.g. FITS and VOTable).
In Julia a common way to represent tabular data is by means of a DataFrame
, and I (as an astronomer) would like to associate metadata informations to such object, both to the table as a whole and to individual columns. But DataFrame
has no support for metadata. This topic has already been discussed (e.g. here and here), but no solution has been implemented.
How can I add metadata support to a DataFrame object ?
To clarify the question as much as possible, let’s assume I wish to add a “source” label to the whole table to specify where the data comes from, and a “unit” label for each column specifying the units for the numbers. A String
type for the label contents is fine, but a more general Any
would be preferable.
In an OOP language I would simply use class inheritance, but this is not possible in Julia. Hence, I can foresee four possibilities, none of which seems optimal to me:
- Use composition to create a new struct as follows:
mutable struct type DataFrame_withMeta
meta::Any
data::DataFrame
end
and use DataFrame_withMeta
objects in place of DataFrame
ones. However, this will force me to to always add .data
wherever a DataFrame
object is needed, i.e. in the vast majority of cases. In other words, I would miss much of the very simple interoperability between DataFrames
and other packages (e.g. Gadfly);
-
Use delegation of the above mentioned struct through TypedDelegation. Although effective, this approach appears quite tricky since it requires me to list all the possible methods accepting a
DataFrame
object; -
Inherit from the
AbstractDataFrame
type:
mutable struct type DataFrame_withMeta <: AbstractDataFrame
meta::Any
...
end
where ...
is the actual content of the DataFrame
structure. This means that each time the DataFrame
structure is changed/updated I will need to change also DataFrame_withMeta
accordingly. Moreover, this would only allows interoperability with packages accepting AbstractDataFrame
objects, not DataFrame
ones;
- Issue a PR to the DataFrames.jl package maintainers where I simply add a
meta
field to the relevant structures, to be accessed as follows:
df = DataFrame(:col1=>1:10, :col2=>rand(10))
a.meta[:source] = "www.some.site"
a.meta[:col1, :unit] = "km / s"
This is the easiest and most straightforward approach: do not adds any package dependency or breaking change. Moreover, it would allow packages which return DataFrame
objects (such as RDatasets, CSV, etc.) to provide metadata, and packages who accept DataFrame
objects (such as Gadfly) to exploit metadata informations.
The drawback is that this change will not add any new functionality to the DataFrames
package itself since the meta
facility will be mainly used by related packages. Hence, there is no point in adding it to the main DataFrames
package.
Is there any other simple and effective solution ?
(sorry for the very long question… )