How to add metadata info to a DataFrame?

Interesting package, thanks for pointing to it! However I’m not sure the best approach for Julia is to store this kind of information in the DataFrame: then functions operating directly on vectors don’t have access to it at all, and methods need to be written specially for DataFrame to use this meta-data. I tend to think that custom array types like CategoricalArray are more appropriate in Julia, as they automatically work for any function supporting AbstractArray. Do you think they can suit your use case?

I had run into this necessity as well and I think @nalimilan 's suggestion makes the most sense, so I quickly prototyped a MetadataArrays package. A MetadataArray is simply a combination of an AbstractArray and some metadata, and it can be used as a regular Array.

I guess you could try it and let me know if it misses some features or if it’s already good enough for your use case.

A concern is that as soon as you filter or subselect the data you lose the metadata, so I should probably add some methods to preserve the metadata when slicing / taking a view.

UPDATE: now taking a view, slicing and calling similar all preserve metadata

3 Likes

For social science work that involves plotting, or if we had the goal of a plotting ecosystem that supported automatic labeling, this would effectively mean me using MetaDataArray for everything, and no longer using normal arrays. It would be two entire systems to maintain.

All plotting packages should accept AbstractArray instead of Array (I think that’s already the case, correct me if I’m wrong). MetadataArray is already fully compliant with the AbstractArray interace.

As an experiment, you could try putting MetadataArrays instead of Arrays in your DataFrame and see that everything just works and it shouldn’t require anything more than the ~ 20 lines of code in MetadataArrays. If it doesn’t, I’d say it’s either a bug in DataFrames or in one or my 20 LOC, but shouldn’t require major work.

As far as automatic labelling in plots goes, I can only speak for StatPlots as it is the example I know best. It would be quite easy to instruct the @df macro (which, again, is just a few lines of code) to look in the metadata for additional info, if MetadataArrays become an accepted solution.

2 Likes

It’s doable. But I don’t fully understand why an ecosystem where people use exclusively MetaDataArrays is a better alternative to including a Dict along with a DataFrame.

I’m not sure how generic MetaDataArrays is, but it might allow adding metadata to any AbstractArray, not just DataFrames.

It might also be useful to use traits to handle whether or not something has metadata, and whether it has particular types of metadata, to make it more extensible.

I agree both solutions are reasonable: store the meta-data in the data frame, or in each column vector. Each has advantages and drawbacks:

  • In the data frame:
    • Advantages: columns are just standard Vector objects, which is simpler for users (it feels kind of weird to avoid using the standard array type in very common use cases) and to implement
    • Drawbacks: meta-data is lost as soon as a column is used separately from its data frame
  • In the column vector:
    • Advantages: meta-data is kept with the data it describes and some functions could make use of that even without supporting data frames
    • Drawbacks: requires using a custom array type; can become quite involved if you need a MetaDataVector{CategoricalVector{...}} (which will be quite common). For example, we’ll need custom recode methods which call the efficient CategoricalArray method instead of the fallback AbstractArray one. Any custom array type may need similar tricks, which (currently) requires one package to depend on the other.

It seems that dplyr has chosen the second approach (see these issues). In particular, haven::read_dta stores meta-data from Stata as attributes of the column vectors. The memisc, Hmisc and labelled do the same. But of course R is much more limited in terms of dispatch, so the situation is quite different.

Something we should also think about is how meta-data could be preserved across streaming operations with Query and DataStreams. CategoricalArray handles this via the special CategoricalValue element type, but a MetaDataArray just contains normal entries, so you can’t retrieve the meta-data from individual entries. I’m not sure which of the two ways of storing meta-data would be easier to handle with Query and DataStreams. This matters in particular when combining Query with plots à la StatsPlots. We should probably resolve this issue before choosing what’s the most appropriate approach. Maybe an extension of schemas to include meta-data in addition to column names and types would be useful; in that case either approach would be OK. Cc: @davidanthoff @mkborregaard @quinnj .

One key issue is metadata validity after transformations, especially subsetting. Handling this is difficult without putting some structure on the metadata. Pairing (some) metadata with columns is a specific solution, which would work well with column subsets.

The R solution of allowing arbitrary attributes to all objects is initially appealing, but quickly leads to difficulties where some attributes have semantic consequences, and one always has to check in functions.

Perhaps a generic interface for metadata, which specific metadata implementations would have to implement, and a generic storage for this (eg a tuple of such objects) could be worthwhile.

1 Like

I prefer keeping the metadata with columns. For the common case of a CategoricalVector, we could embed the metadata in the CategoricalArray. A metadata trait could describe whether an AbstractVector has metadata.

Wouldn’t it make sense to just have an appropriate <: AbstractVector that contains the metadata? AFAICT DataFrame can then handle that transparently.

The key issue is that any type would have to be incredibly generic. If someone were to make their vector type, with new functions assigned to that vector type, and someone else were to like to use a metadata version of that type, would they have to overload the MetaData vector type themselves?

struct MyNewVector <: AbstractVector end

function sum(x::MyNewVector) return 1 end

customVector = MyNewVector()
d = Dict(:label => "The label for this vector")
a = MetaDataVector(customVector, d)

sum(a) # How do we determine this behavior 
# without forwarding the function `sum`?

Is it possible to have a wrapper type automatically incorporates all of the functionality of its “inner” type?

Further advantages of the “column based” approach:

  1. DataFrames are not the only type of tabular structure: it seems awkward to implement this for all tabular structures when there already is a mechanism to implement it for all AbstractArrays. Plus, it’s unclear to me that this is only relevant for tables, an Array with metadata is probably a useful structure per se.

  2. It’s much easier to pass this metadata around if it’s at the single array level. For example, if you do df[1:3] you get a new DataFrame: I’m assuming you’d want to pass only the metadata concerning the first three columns, which in the Array strategy happens automatically, whereas with the DataFrame + Dict approach you would need custom code for this (and custom code for renaming columns, custom code for adding columns, etc…). This gets even more extreme with IndexedTables, as they are immutable objects and normally you would rarely transform the initial object in place, but rather return a new object that share several columns with the original. Here you’d really like to automatically keep only the relevant metadata

  3. The “column based approach” is actually strictly more general than the other, as you could create a column of MetadataVector(fill((), size(df,1)), df_metadata) and you could keep your metadata there.

On the more technical side, MetadataArrays should really always use the specialization/optimization of the underlying AbstractArray. In the case of categorical arrays the two options are:

  1. Add a dependency on one package to the other and implement specialized recode methods for MetadataArray{CategoricalArray}

  2. Use the MetadataArrays package to define a AbstractMetadataArray type. CategoricalArrays would then be <:AbstractMetadataArray and could also store metadata somewhere (by defining their own specialization of the metadata function). If the subtyping is too much, I agree with @tshort that one could have a hasmetadata trait.

I’d tend to be in favor of option two, as a CategoricalArray is already a sensible place to store metadata, whereas the MetadataArray would only be required for other AbstractArray types.

1 Like

I don’t think that’s the case. There’s no reason why you would be more likely to attach meta-data like a description or a unit to a CategoricalArray than to any other kind of array.

I think in theory attaching meta-data to column vectors is a superior approach, but that in practice it’s currently much less clear. @pdeffebach hits the nail on the head when noting that Julia doesn’t currently allow delegating all method calls to the wrapped object: we can only delegate methods we know about, or use the AbstractArray interface, but custom methods beyond that won’t work at all. That’s probably something which will improve at some point as it’s frequently mentioned as a problem. But in the short term the question is: is the current system good enough? Maybe that’s OK if we typically expect people to only wrap Vector and CategoricalVector in MetaDataVectors. But if we expect to support many different kinds of arrays, it’s going to be much more problematic (and adding some code in DataFrames to store meta-data will then be much easier).

For reference, the previous discussion on GitHub is here.

1 Like

It would be great to collect some use cases before proceeding. What are typical examples that people have in mind when thinking of metadata? Are any/some/all operations on the data meaningful without said metadata? If I transform columns using a function into another column, what happens to metadata?

I am very skeptical of the utility of attaching an unstructured associative collection of arbitrary metadata to objects. It has a lower initial cost and allows a very exploratory approach, but I am concerned that at the point metadata starts affecting semantics, conventions will be used for various special cases, not unlike R, which evolved a very convoluted approach to deal with everything based on metadata.

1 Like

I would imagine that metadata would record only the actions you really want to record. I don’t think there needs to be anything automatic. Imagine we implemented a system all we had was a dict from Symbol to an array of strings. The following is

df <- @> df begin
    @transform(income_normalized = (:income - mean(:income)) / std(:income))
    @addmeta(:income_normalized, "Normalized version of :income") # pushes to the array in the dict
end

Now, your boss wants to play with the dataset you have spent a long time working on. Unfortunately, they don’t want to read through piles of cleaning code. No worries! they can just call getmeta() on variables to see relevant steps that were taken.

Additionally, I don’t have worry about a separate Dict object and keeping it in sync. All the information is in the dataframe. That’s just my desired use case (apart from callable nice labels described above).

EDIT: Another use case is that in Stata, labels are searchable with browse window. This is really helpful if you forget the name of a variable but know what its about. Obviously it would be non-trivial to implement a reverse search type thing, but having the names stored in a Dict with the dataframe would enable a lot of cool functionality like that.

I think the main example is the label/description of the variable. Others include conditions of measurement, definitions, survey questions (see memisc). If you transform columns, meta-data is lost, that’s all – except if you just select a variable without modification of course.

I don’t think meta-data should really affect semantics. It just provides additional information used for display of to create codebooks. What problems do you have in mind?

1 Like

A few comments:

I completely agree, and as I said before:

Therefore my suggestion is simply to avoid trying to attach a strong semantic meaning to metadata, and accept that the usability of the metadata actually amounts to the goodness of the agreement between the software who produced the DataFrame and the software which consumes it (e.g. the plotting package).

More complex solutions are clearly desirable, but very hard (I would say almost impossible) to implement.

Concerning the MetaDataArrays package, I would like to clarify that the most important feature I was thinking of while writing the first post is the possibility to add metatada to a table as a whole. Columns metadata are welcome if they come for free, but their usefulness is (IMO) rather limited.

To elaborate a little bit, I’ll describe a few typical use cases in my field (astronomy):

  • I have a table of photometry measurements, but I also need to know which source has been observed, the characteristic of the filters, etc.;
  • I have a catalog of source coordinates, but I also need to know the epoch the coordinates refers to, and if they are equatorial or galactic or ecliptic, etc.;
  • I have a table of point sources in an image, but I also need to know the astrometry solution for the image in order to attach a coordinate to each pixel;
  • I have a table of flux measurement as a function of time, but I also need to know how those flux have been measured and calculated;
  • I have a table downloaded from the web, but I also need to know when it was last updated in order to decide wether to download a new one.

All these situations are easily handled by adding metadata to the table as a whole. Besides, the use cases for column metadata I can think of is the possibility to specify a label, a unit, a factor/offset and… what else?

In summary I think of metadata as a way to describe the details of an entire collection of data, like a table. Attaching metadata to smaller entities (such as a column, or even worse to individual numbers), is IMO not that useful. But of course I could be biased by my use cases.

As a final note, in the last commit of the PR I implemented the possibility to copy metadata while copying/slicing/creating a view on a DataFrame.

1 Like

Hi all,

To add to this discussion, I have developed a simple Schemata.jl package, which I use to provide guarantees about and highlight issues with the data sets I encounter. I intend to clean it up and release it when Julia 0.7 is released.

From the README:

A Schema is a specification of a data set.

It exists independently of any particular data set, and therefore can be constructed and modified in the absence of a data set.

This package facilitates 3 use cases:

  • Read/write a schema from/to a yaml file. Thus schemata are portable, and a change to a schema does not require recompilation.
  • Compare a data set to a schema and list the non-compliance issues.
  • Transform an existing data set in order to comply with a schema as much as possible (then rerun the compare function to see any outstanding issues).

Note also that a Schema allows for metadata at the column level (type, value support, etc), table level (index, row constraints, etc) and multi-table level (joins).

Seems that something like this is suited to the ecosystem approach. I’m not sure how exactly it would fit, but I see there are schema-like types in the DataStreams ecosystem as well as the JuliaDB ML system. Suggestions welcome, happy to integrate into the data ecosystem if desired.

Cheers

1 Like

…and the link

https://github.com/JockLawrie/Schemata.jl

I have to disagree here, as my primary use case would be to store column-associated metadata. I typically have tables where each column is a sample, and I need to store metadata about that sample. I often need operations like, “view all sample that have diagnosis x” etc.

Of course, it might end up making more sense for me to store stuff in a table other than a DataFrame, which is why I sort of wonder if it makes sense to have a generic metadata type that multiple packages can use in various ways, even if it’s basically a thin wrapper around a dictionary.

1 Like