How to add metadata info to a DataFrame?

Storing both data and metadata in the same object or container is a very common approach in many research fields, and it has proven to be very successful in (for instance) astronomy where a few standards had been officialized and are now commonly used (e.g. FITS and VOTable).

In Julia a common way to represent tabular data is by means of a DataFrame, and I (as an astronomer) would like to associate metadata informations to such object, both to the table as a whole and to individual columns. But DataFrame has no support for metadata. This topic has already been discussed (e.g. here and here), but no solution has been implemented.

How can I add metadata support to a DataFrame object ?

To clarify the question as much as possible, let’s assume I wish to add a “source” label to the whole table to specify where the data comes from, and a “unit” label for each column specifying the units for the numbers. A String type for the label contents is fine, but a more general Any would be preferable.

In an OOP language I would simply use class inheritance, but this is not possible in Julia. Hence, I can foresee four possibilities, none of which seems optimal to me:

  1. Use composition to create a new struct as follows:
mutable struct type DataFrame_withMeta
  meta::Any
  data::DataFrame
end

and use DataFrame_withMeta objects in place of DataFrame ones. However, this will force me to to always add .data wherever a DataFrame object is needed, i.e. in the vast majority of cases. In other words, I would miss much of the very simple interoperability between DataFrames and other packages (e.g. Gadfly);

  1. Use delegation of the above mentioned struct through TypedDelegation. Although effective, this approach appears quite tricky since it requires me to list all the possible methods accepting a DataFrame object;

  2. Inherit from the AbstractDataFrame type:

mutable struct type DataFrame_withMeta <: AbstractDataFrame
  meta::Any
  ...
end

where ... is the actual content of the DataFrame structure. This means that each time the DataFrame structure is changed/updated I will need to change also DataFrame_withMeta accordingly. Moreover, this would only allows interoperability with packages accepting AbstractDataFrame objects, not DataFrame ones;

  1. Issue a PR to the DataFrames.jl package maintainers where I simply add a meta field to the relevant structures, to be accessed as follows:
df = DataFrame(:col1=>1:10, :col2=>rand(10))
a.meta[:source] = "www.some.site"
a.meta[:col1, :unit] = "km / s"

This is the easiest and most straightforward approach: do not adds any package dependency or breaking change. Moreover, it would allow packages which return DataFrame objects (such as RDatasets, CSV, etc.) to provide metadata, and packages who accept DataFrame objects (such as Gadfly) to exploit metadata informations.

The drawback is that this change will not add any new functionality to the DataFrames package itself since the meta facility will be mainly used by related packages. Hence, there is no point in adding it to the main DataFrames package.

Is there any other simple and effective solution ?

(sorry for the very long question… :wink:)

8 Likes

I think this is a fantastic idea and was going to propose a similar thing myself. There are a lot of use cases where you simply cannot work with a dataset without metadata attached, like World Development Indicators or large surveys.

This is the easiest and most straightforward approach: do not adds any package dependency or breaking change. Moreover, it would allow packages which return DataFrame objects (such as RDatasets, CSV, etc.) to provide metadata, and packages who accept DataFrame objects (such as Gadfly) to exploit metadata informations.

I think this highlights the main issue. We wouldn’t explicitly be adding functionality to DataFrame objects but this change would require major updates to existing plotting libraries.

Here is my own use case for metadata in dataframes that I have been meaning to post here.

With impact evaluations based on survey data, you want to use a variety of smaller questions to aggregate into a large index. For instance, our financial well-being index is composed of current income, durable, assets, and current consumption, and many other variables. We construct the variables in the variables construction code, then when it comes time for the analysis, it’s a pain to always go back and forth to figure out what is in each variable.

The way I get around this in stata is to wrap variable construction in a closure.

program define myStandardize
syntax varlist, newvar(name)
   gen `newvar' = 0
   foreach var in `varlist' 
      replace `newvar' = ...standardized `var'
    end
    note `newvar': A standardized additive index of `varlist'
end

This way I can use the command note list to see how each variable was constructed, without having to sort though thousands of lines of cleaning code.

This is just one, albeit niche, use of metadata in dataframes. The most pressing reasons to add metadata are super large datasets and automatic creation of tables. If you don’t think metadata belongs in DataFrames, I think you are ignoring a lot of very common use cases from the social sciences.

I think the best place to start would be to submit a PR to DataFrames where the metadata is just a simple Dict from column names to strings. Then we can expand on that.

1 Like

I rather liked this suggestion in that other thread:

Generally speaking, I think in julia it’s better to avoid accessing the internals of an object directly (eg a.meta[:source] = "www.some.site") since a function can be generic and do similar things to objects that have different internal structure.

4 Likes

Well, my proposal n. 4 would not be a breaking change, hence all plotting libraries will work seamlessly. An update is required only if the plotting package want to use the new meta facility.

True. I suppose that in the end, however, we would want an ecosystem where it is very easy to replace :x with the appropriate label in a dataframe wherever it appear in a plot.

Unfortunately I have to agree.
Unfortunately because I think that Julia would greatly benefit from some kind of access control to struct fields (such as private in C++). The access control is too often mistaken as a way to hide internal details, while it is something completely different: it is a way to organize data structures and clearly distinguish what is supposed to be read/written by users, and what is not. But this is of course off topic…

Exactly!

Maybe off topic, and there are certainly people more knowledgeable about this that me, but I’ll lightly disagree :smile: . It’s just a different way of thinking about access control - you clearly know if something should be accessible to users if there are methods defined for reading/writing it. This makes writing generic code much more pleasant since I don’t have to know about the internals of an Array or a DataFrame, and in fact the internals can be whatever the authors of those objects want, but I can use filter! on both and know what to expect.

Or if I end up implementing my MetaDatum object, and write a getmeta(MetaDatum, inds...) method, it can be generic with the getmeta(DataFrame, inds...) method if it ever gets written, even if the internals of MetaDatum and DataFrames are wildly different. I think this ends up being one of the best things about julia from a development standpoint…

I think this essentially means that whatever metadata we have has to have at least two types, one for “nice” labels that are to be displayed and another that shows stuff about the variable that the user of a dataset needs to know. Stata handles this via a label' metadata category and a notes’ metadata category.

Today or tomorrow I will put together a PR to get the ball rolling. Unless you beat me to it.

1 Like

I’m already on it … :wink:

Let me slightly disagree :grin:. This mainly depends on the situation, consider the following:

mutable struct Example
  private_var0::Int
  public_var1::Int
  public_var2::Int
  ...
  public_var1000::Int
end

Would you write two methods (one to read and one for write) for each of the 1000 public_var1, public_var2, etc. just to be agnostic of the private_var0 field ?
Or would you simply prefer to use .public_var1, .public_var2 and so on?

Again, this is too off topic, it would be interesting to discuss it somewhere else, but I would keep the focus of this discussion on metadata :wink:

At least as far as I’m concerned, a PR implementing meta-data support in DataFrames would be welcome.

2 Likes

I hope that whatever approach to metadata is taken, that it is as generic as possible (which seemed to be the case in some of the above proposals), and that it can keep track of more than one source of metadata (i.e. one from the database, for display labels, etc.). Trying to merge them would be a difficult problem, I think it’s better to be able to handle different sorts from the start.

Done: https://github.com/JuliaData/DataFrames.jl/pull/1413

In the PR comments you will find the description and an example.
I hope the approach is sufficiently generic.

There can be several approaches to distinguish a database label from a plot one, and none is obviously right to me. Current PR implement metadata as Dict{Any,Any}, so I believe there is enough room for specialization.

Comments and suggestions are more than welcome!

Maybe you could put Dict{Any,Any} into that dict then, i.e. :db => Dict(...), :labels => Dict(...), etc.

Well, a :db entry in the metadata is not necessarily needed in all cases, e.g. if you create a DataFrame directly in Julia from calculated values. The same applies to :labels. Hence, I would suggest the opposite: setmeta!(df, :db, @NT(label="DB label"))

The difficulty with metadata is that it is really hard to find an implementation which is sufficiently general and informative, i.e. all possible data source can provide all the necessary metadata, and all sinks knows how to interpret them.

A Dict{Any, Any} ensures no assumption is made on the communication between the source and the sink, it simply provides the “communication channel”

If I understand that correctly, that’s similar to what I was suggesting, a 3 level approach, where the
top level metadata Dict has a :db tag, which then stores all of the meta data from the database
(which you’ve represented as a named tuple instead of another level of Dict) [which will be more efficient,
if that data is fixed, which it probably would be, from a database, once loaded into Julia, at least)

My database background was mostly with fast distributed persistent multi-dimensional associative arrays,
so handling that sort of thing with multiple levels was a no-brainer.

Can you check out my package “Labels.jl” in GitHub - mwsohn/Labels.jl: Provides functionality to attach variable and value labels to DataFrames? This is a simple package that implements functionality by which variable labels and value labels can be attached to DataFrame columns and values. Mayby a similar functionality can be implemented as part of the meta data. The Labels object can exist separately from a DataFrames object or it can be part of it.

Yes, the top level should be one Dict{Any,Any} for the table as a whole, and one Dict for each column, in order to accommodate all the possible key/value pairs required in each situation.

Nice, thank you for signaling it.

However, I would like metadata to be part of the DataFrame object, i.e. when I write df2 = copy(df1) I wish to copy all the metadata as well.

Moreover, the metadata value should be stored as Any not String, in order to accommodate all kinds of metadata (such as, e.g., a Unitful object).