How to add metadata info to a DataFrame?

OK, let’s summarize a little bit:

  • Adding metadata to DataFrame (or similar structures) is very useful;
  • There is no point (by now) in trying to attach semantic meaning to metadata. The best we can do is to appropriately propagate metadata through copies/slices/views, and discard metadata when more complicated transformations are involved;
  • Two approaches can be envisioned: adding metadata to the DataFrame (or similar structures) as a whole, or attach them to individual Arrays. Both have pros and cons, but likely we will need both;
  • By now, a reasonable way to store metadata is a Dict{Symbol, Any}, regardless of the followed approach;
  • Concerning the implementation with DataFrames:
    • there is a PR to encapsulate metadata within a DataFrame structure, both at a global and column level. The flaw in this implementation is that the metadata, stored in the colmeta field of the DataFrame structure, do not add any functionality to the package itself. It would be much better to leave the DataFrames package as it currently is and wrap it in a container along with metadata;
    • I tried the wrapping approach here, but it turns out there is a lot of boilerplate code to be written;
    • the difficulty in extending the DataFrame object may ultimately lies in the way the package has been implemented. E.g., the function Base.getindex(df::DataFrame, col_ind::ColumnIndex) in the package should actually accept an AbstractDataFrame as input, not a DataFrame. Moreover the SubDataFrame struct inherits from AbstractDataFrame, but the SubDataFrame and DataFrame structures do not share the same fields.
    • I am not sure these are issues or intended design decisions for the DataFrames package, but they don’t allow the DataFrame code to be easily re-used (see here for a discussion on code reusing by means of composition).
5 Likes