DataFrames.jl: metadata

The discussion on metadata in DataFrames.jl has started 10 years ago, see this issue.

In DataFrames.jl we decided to finally settle the issue. The relevant PR is here.

The fact that metadata issue has been open for 10 years reflects the complexity of this decision. In the end we have three major options we can choose. In this post I would like to open a vote on which of the approaches the community would prefer most. All comments are welcome.

Let me summarize the options:

  1. Do not add metadata to DataFrames.jl; instead develop some other more generic mechanism for Tables.jl tables that DataFrames.jl could leverage;
    • pros: this will be most composable, most Julian, and easiest to maintain from DataFrames.jl - as I will not add any complexity to the package;
    • cons: it is not clear if it will ever happen; cf. there is a general-purpose GitHub - Tokazama/Metadata.jl: Generic interface for attaching metadata to stuff. package, but it is not Tables.jl specific; we want something that could seamlessly integrate with persistence options offered by packages like Arrow.jl or Parquet2.jl)
  2. Add metadata to DataFrame object but do not propagate it under transformations
    • pros: metadata will never be propagated incorrectly
    • cons: if you want to propagate metadata, you have to manually do it as a follow up to transformation
  3. Use complex set of rules how metadata should be propagated (this is the functionality implemented in the Metadata on data frame and column level by bkamins · Pull Request #3055 · JuliaData/DataFrames.jl · GitHub PR, so you can check the documentation section to check the rules):
    • pros: metadata gets propagated in cases when it makes sense to propagate it most of the time
    • cons: sometimes metadata is incorrectly propagated, in which case you need to call dropmetadata! function on a data frame to remove metadata from it (fortunately this is a very cheap and easily chained operation - so it is relatively easy to fall back to option 2 if one does not want metadata propagation); essentially this design means that preferably only “very stable” metadata (like descriptive column label or source of data for the column) should be stored (some ecosystems store “fragile” metadata like column mean and such metadata could be invalidated when e.g. taking a subset of a data frame)

Could you please vote below what you prefer us to do? Thank you!

  • No metadata in DataFrames.jl
  • Add metadata to DataFrame, but never propagate metadata
  • Add metadata to DataFrame, and propagate it (accepting that it will be incorrect sometimes)
  • Other option (please comment :smile:)

0 voters

Your input will be much appreciated (and will finally help us to resolve the issue that has been open for 10 years now and we already failed several times to make a decision what to do).

8 Likes

Will the final solution live in DataFrames.jl or an extension package?

Hey @bkamins , my personal vote was for option 3. I was thinking from a greedy user perspective as yes, I would like to have my metadata and store it too (to riff on the “have my cake and eat it too” aphorism). As a question from a development point of view, would this get in the way of continuing to have DataFrames.jl continue to be developed easily or what would happen here? I love maximum flexibility but my only worry is if it comes at the cost of stagnation for package development. Thanks!

1 Like

If we add metadata to DataFrame it must live in DataFrames.jl. Option 1 (do not add metadata) - is exactly an idea to have it in some other package (but as I have commented - it is not even clear now how it would be done).

No - it would not affect DataFrames.jl development.

The PR I have done now is huge (changing almost everything in DataFrames.jl), but after this PR is merged adding new functionalities will be easy (the only thing will be that each time we will have to decide what to do with metadata when we add new functionality, but this is a similar issue to the “what name to give to some function” question - easy to implement, but sometimes hard to decide).

2 Likes

I just want to note that we should avoid feature creep in general, it’s a spectrum, but I just want to leave a note here

2 Likes

I agree with you for sure. I was torn between no metadata in DataFrames to separate metadata from DataFrames. My greediness won over to choose having metadatas all in one place as opposed to having something separate as it would be nice for me to have one less workflow to worry about. It gets a bit murky as I feel metadata handling + DataFrames goes well together but I can see an argument to keep it out. What is your perspective @jling ?

I voted for option 2 but I am also open to option 1.

I don’t see how automatic propagation could work. I can do arbitrary transformations to a column. How can you possibly tell when some transformation invalidates the metadata (which can also be arbitrary information)?

1 Like

How can you possibly tell when some transformation invalidates the metadata

It cannot but the benefits of propagation outweigh the benefits. Lets say you do the following

julia> df = DataFrame(sales = [100, 300, 200]);

julia> colmetadata(df, :sales)["label"] = "Sales (USD)";

julia> using DataFramesMeta;

julia> transform!(df, :sales => ByRow(log) => :sales);

julia> colmetadata(df, :sales)["label"]
"Sales (USD)"

then sure, your metadata for :sales is off.

But imagine you have a long chain block of many transformations, one of which is

julia> using DataFramesMeta;

julia> df = DataFrame(sales = [100, 300, missing]);

julia> colmetadata(df, :sales)["label"] = "Sales (USD)";
# Crap, 0 got incorrectly coded as missing somewhere
julia> @rtransform! df :sales = ismissing(:sales) ? 0 : :sales

Ensuring propagation for every “small fix” you do to the data will be a huge pain. Way too much of a pain to deal with.

And note: propagation in transform etc. only happens when

  1. The destination is the same as the source column (like in the @rtransform call above). This is the equivalent of Stata’s replace.
  2. The transformation is of the form :x => :y, i.e. no copy is made.

I think the current behavior (on the PR as it is now, with propagation) is best.

If you are modifying a variable so much that it needs a new label, make a new variable! And on balance, transformations which preserve the variable name will keep the metadata. This is a simple rule for practitioners to follow.

That is why by default when you specify a transformation like :input_column => some_function in DataFrames.jl the auto-generated column name is not :input_column but :input_columns_some_function. The point always was that if the “meaning” of data in a column changes then the column should have a different name.

Another case to keep in mind when column metadata is kept is row subsetting, e.g. df[some_condition, :] will propagate metadata. Again, it could potentially invalidate some metadata (e.g. if you kept the mean of the column as metadata), but the reasoning was, that most of the time row subsetting will not invalidate “very stable” metadata.

@pdeffebach - if we decide to keep metadata propagation (and I assume previous Stata users are probably in favor of this option), I think it would be great to prepare some guide with best practices how metadata should be used (and how it should not be used).

2 Likes

That is a very simple example and already shows how fragile propagation is. In your example, the metadata is basically not needed. Nevertheless, it is easy to invalidate. Suppose you turned it into log growth rates… then is “Sales (USD)” still a good label?

Suppose instead it was “Sales of Top 3 Firms (USD)” . Now I push! or subset to change the number of firms. Should it propagate the metadata? [Wow @bkamins and I must have experienced some telepathy here]

I guess my point is that basically any transform is likely to invalidate the metadata. How wrong it becomes then is up to the user’s tolerance for incorrectness or preciseness. But I’m not sure that really saves any trouble over just using informative column names.

@pdeffebach - if you were willing to - maybe you could write some brief comments, from your experience, why metadata is needed and what are its uses in practice? Maybe this would help other users to understand your experience with using metadata? Thank you!

If you are turning the variable :sales into something that no longer represents the level of sales of a firm, and rather a growth rate, then you are already committing an error by naming it :sales. You should create a new variable entirely. DataFrames.jl will not propagate this metadata and the problem is solved.

1 Like

Yes.

I have long been an advocate of metadata. For context, much of my job as an RA and a researcher is analyzing household surveys in Stata. I believe column metadata similar to Stata will help adoption of DataFrames.jl for this kind of data analysis.

I use metadata in 3 ways.

  1. Raw survey data. Programs like SurveyCTO etc. export Stata .dta files with column labels attached. When you first read in the data you often see variables like a1, a2, etc. corresponding to survey questions, and the question text is given as the label. You use the labels to re-name the variables to something easier to work with.

In this instance, dropping metadata on subsets would be very annoying. If I wrote

use raw_data.dta, clear

// Keep only completed interviews
keep if complete == 1

and suddenly all of the variable labels necessary for exploring the data disappeared. This does not happen in Stata. Metadata is kept, and we should do the same.

Analogously, often the survey results come in many different data sets, and you have to merge or join them together. It would be very annoying if every time I merged or joined I had to manually propagate the metadata.

Furthermore, having metadata exist separate from the data frame would make it impossible to track joins and appends. And there may be thousands of columns. The only workable solution is to have the metadata exist within the data frame.

  1. Keeping track of transformations. In Stata, I frequently make variables which are the function of other ones, for example the sum of a binary index. To do this I write a function.
cap program drop mean_binary_index
program define mean_binary_index
	syntax varlist, newname(name)
	egen `newname' = rowmean(`varlist')
	note `newname': "Row mean of `varlist'" 
end

This creates a new variable which is the average of inputted variables. In the last line of the function, I add a note to the new column created. This way, in the future, someone can do

note list myvar

and see how the variable was constructed. It would be a shame if I made a tweak to myvar later on and we lost crucial information about how the variable was constructed.

In Stata “notes” can be longer than “labels”, and you can attach multiple notes to a variable, but it can only have one label. This difference is not super relevant for metadata in DataFrames.jl due to Julia’s flexibility.

  1. Pretty printing. I pride myself in the tables I can make in latex using Stata. And it wouldn’t be possible without variable labels. Propagation is less important here, as I usually assign labels just before the analysis stage when I make tables. Here is an example of a Table I made which was published. The row names on the left hand side are variable labels.

2022-07-20_14-12

6 Likes

This seems a problem very similar the one we had in DEDataArrays.jl. See the discussion here: DiffEqs: Hybrid (continuous+discrete) system / periodic callback - #19 by ChrisRackauckas

In this case, we had a structure with a state and a set of parameters. The question about how propagate those parameters at each operation is hard to answer. @ChrisRackauckas eventually deprecated DEDataArrays.jl because it will be difficult to support every corner case.

Hence, given this past experience, I would vote for no metadata in DataFrames.jl. If the community wants, then we can add things to Tables.jl or another package.

3 Likes

Personally I’d prefer a separate package for that. Maybe the current problem with Metadata packages is that they miss the traction to establish a standard, but the moment Dataframes develops (or embraces) a Metadata package that plays nice with DataFrames, this traction is definitely there.

But in any case, I think any of the current suggestions is better than not having them at all.

I want to emphasize my point above that when you consider propagation rules, the fact that a DataFrame may have 10,000 columns, joining, merging, it’s hard to imagine a third-party package being able to implement a reasonable workflow. For metadata to be useable, it has to live in DataFrames.jl.

1 Like

I agree that metadata is more about communicating information about the data to the reader. It lets the user of the program keep track of what’s going on. I would not advocate for using it to store, say, intermediate calculations that are read in later in the analysis code. This may also resolve @Ronis_BR 's concerns above. If it were the data, it would not be the meta data.

2 Likes

sales as a variable name could equally apply to the level of sales, log(sales), and diff(log(sales)). That is sort of my point about metadata and automatic propagation. When you are writing code, you know at each point what the column name sales is meant to actually represent. If you encode that context into metadata, it will inevitably (and probably quickly) become incorrect or outdated.

However, I suppose that really my comments are about the usefulness of propagation. I do not see the value, given that I would probably end up manually handling changes to metadata quite often if I were to use it in my pipelines. That doesn’t mean that it shouldn’t do so automatically, if that is what others want. As long as the system is explicit about the behavior, I guess it might not be an issue as long as there are no side-effects from potentially stale metadata. E.g., functions should not dispatch on metadata, in case it is no longer accurate.

The point about using labels to help pretty print is good, but as you say it is unrelated to propagation.

2 Likes

I appreciate the comments. However let me note that if I were an RA and I came across data set and code where someone used sales to represent log(sales) or diff(log(sales)), I would be somewhat confused or annoyed. sales means only level of sales. log_sales and diff_log_sales are appropriate.

1 Like