DataFrames.jl: metadata

The issues raised by @Zach_Christensen are exactly the reason for the current proposal in DataFrames.jl (assuming it gets accepted - I do not want to rush with this decision as the discussion in this thread shows that the issue requires careful consideration):

  1. We wanted to add something that people who need metadata would find useful now (as the general discussion might still take a lot of time to settle). This will for sure require that people understand when the proposed metadata handling approach should be used (as it will not fit all use cases).
  2. As a starting point we defined DataAPI.jl interface that is implementation independent. This means that if in the long run we have some better solution in the Julia ecosystem in general we can drop internal details that are now added to DataFrames.jl and switch to the superior solution, just keeping the DataAPI.jl interface.
  3. Also for these reasons we mark metadata propagation rules as experimental; both to allow changes in the future if needed and to highlight the fact that the proposed metadata should not be used to guide users’ program logic, but are rather helper information that should make users’ life easier
  4. Indeed the term “metadata” for what we have in the DataFrames.jl proposal might be debatable; we have chosen it to reflect the fact that in e.g. Apache Arrow this term is used for such structures.

Now I see that if we decide to add metadata to DataFrames.jl we need to write much more about how it should be used (and how it should not be used given current design) in the manual (and then probably @pdeffebach and his experience would be of great help).

5 Likes

What I had in mind is that there would still only be one set of propagation rules. And the short, predefined list of allowed metadata fields would all follow that same set of propagation rules. So, only the metadata fields for which that set of propagation rules makes sense, like label and source, would be supported by DataFrames.jl.

In fact, I’ve only seen a case in this thread for one particular metadata field: label (or description). Perhaps that’s the only metadata field that DataFrames needs to support. Or we could stick to a very minimal list, like the following:

  • description
  • source
  • unit

I think description, source, and unit could all follow the same set of propagation rules (e.g., the rules that are specified in the above PR). No need for separate propagation rules for each of them.

As @alfaromartino mentioned, I don’t think it makes sense to include a feature that will return incorrect results in many cases. For example, we should not support a feature that allows a user to add a column_mean metadata field to a column in their DataFrame, take a subset of their DataFrame, and then propagate the column_mean metadata field to the new DataFrame.

In other words, we should only implement metadata fields with associated propagation rules that we know are correct. Some folks have been complaining about correctness issues in Julia… Just wait until we have arbitrary metadata fields in DataFrames. :joy:

1 Like

So do you propose to limit in the current PR metadata to only one possible metadata value that could be e.g. called label or note or description so that it would be clear that such metadata matches the rules we have?

I want to push back on the issue that DataFrames.jl would be doing something “wrong” when a user writes

@rtransform :sales = log(:sales)

and forwards metadata.

Julia is a tool, and it’s always possible for a user to do dumb things with a tool. For instance, I can write

julia> all_west_coast_states = ["Washington", "Oregon", "California"];

julia> popfirst!(all_west_coast_states);

julia> all_west_coast_states
2-element Vector{String}:
 "Oregon"
 "California"

Should we disallow mutating arrays because the name of the array might refer to something different than it was originally?

The variable label should not be thought of as the absolute source of truth about data. The code is the source of truth. Nonetheless, metadata which is usable (propagates without an API that is a pain to work with) is incredible valuable in staying organized with very large datasets.

@pdeffebach - but maybe what @CameronBieganek proposes makes sense? Instead of defining a super flexible metadata system (what we do now), we might implement a much more limited one - with only one field that user can use to store metadata (similar to Stata if I understand correctly)?

Do you think that in your use cases you would need more than one field per column + one field per whole table?

I think @CameronBieganek is right conceptually that labels and notes are the best use for metadata. Nothing people can take “too seriously” and certainly nothing that people rely on for any strict notion of correctness (which should be determined by the code itself)

But having multiple entries per variable is super useful. Here’s something I worked on today, adding many notes to a variable to keep track of things

profits_4w_val_6m:
  1.  "Nos dijo que sus ventas del mes pasado fueron pesos ...."
  2.  "bus8_6m"
  3.  "Imputed values based on profits_4w_6m if missing. Performed 0"
      "replacements. Including 0 in the top category, which was given the value
      24000" "(The 95th percentile of profits_4w_val_6m)"
  4.  "Winsorized at 99th percentile"

so it would be limiting to have only one option for a single string, especially given how Julia isa “real” programming language while Stata is not, and thus should be more flexible.

Yes, I think that is one good approach. To satisfy the

use case, perhaps the field could be called notes and it could accept a vector of strings.

That was not the example I used. I agree that for your example the user made a mistake and should have written @rtransform :log_sales = log(:sales). The example I was referring to was the propagation of metadata fields after a call to subset. In that case, DataFrames is doing something wrong because the propagation rule is incorrect in many cases. It’s only correct for some metadata fields like description.

Perhaps a better term for what @pdeffebach is referring to would be “annotations” or “descriptors”? Whatever we call it, the descriptions seems very narrow compared to a lot of other metadata out there. I’m not sure whether it actually addresses propagation though. Limiting to specific keys sounds like a foot-gun, unless those keys are truly the only thing you want to support.

I would hardly call it “many cases”. If the column label is "Top 50 countries in the world by GDP per capita" and someone takes a subset, then that’s technically incorrect. But at the same time, that’s a user problem, and users are allowed to make mistakes. DataFrames.jl needs to provide an API that is easy to use and does the right thing in the vast majority of cases. I believe the current proposal does that.

If the column label were "Sales (USD)", then subsetting is fine. Metadata in a data frame which didn’t persist after a simple subset would be unworkable.

1 Like

I agree that we shouldn’t define an interface based primarily on people using this incorrectly. It would be helpful to provide ways of injecting rules for propagation though. That places the responsibility on the user and supports a more generic interface. On the other hand, if the attitude is that this is experimental and we have a lot of wiggle room in the future, let’s just implement a restricted set of features now and build on it.

[emphasis added]

This is going to come back to haunt us. I can see the Hacker News thread now. Maybe I’m being melodramatic, but I really don’t think that’s the right path. Also, “the vast majority of cases” is just an assertion that you’ve made. In practice it might be “the vast minority of cases”. It hardly makes sense to write code that works in some unknown fraction of cases.

I want to push back on the whole idea of generic metadata. The “attributes” feature in R is not a feature we want to copy, it’s a feature we want to avoid. The distinction between data and metadata is ambiguous. Really, metadata is just data. You should put all the data relevant to your type in your struct.

Attempting to implement a generic API for metadata is an attempt to make concrete Julia types customizable by the user. But that’s not the right way to customize a concrete type in Julia. The right way to customize a concrete type in Julia is to use composition. In other words, wrap the type in your own type and add the new features that you want.

The developer of a type has full control over the behavior of the type, and should only implement correct behaviors. The DataFrame need not bend to the interface whims of the user—it should only implement correct table behaviors.

4 Likes

But do you agree that if we changed the name from metadata to notes and made it clear that the functionality is only intended to add notes to whole table or columns and that these notes would be preserved following the rules currently implemented in the PR then things would be fine?

2 Likes

Yes, I agree with that. :slight_smile:

2 Likes

There should be multiple note fields per column/table. I think you could support a small set of fields and satisfy most people, but I don’t think that restriction really enforces anything so it may as well stay open. I personally would add fullname, and survey people are going to want question.

colname: :T
fullname: "Temperature"
unit: u"°C"
source: "Thermocouple #7"
description: "Temperature measurement from the upper-right corner."

A huge benefit of this feature for me is going to be pretty printing the table at the end. I can just refer to a column as T throughout my code, but all the supplemental information will travel with the column. I can change units and just change one field. Then I can print the table columns as fullname * "(" * unit * ")" and print the table with some metadata footnotes. For this reason, I would argue against limiting metadata to one big notes string.

1 Like

I agree with @CameronBieganek. Any language DataFrames.jl can use to make metadata seem less “serious” is good. So notes is a good name change.

2 Likes

@Nathan_Boyer - the idea, if I understand it correctly, is to change the name of the feature we add from “metadata” to “notes”. All else would be unchanged. You still would be able to store a dictionary mapping note names to note values.

So from the functional perspective nothing would change in the PR, except for one thing - we would not automatically propagate metadata from other tables in DataFrame constructor and most likely the functionality would not implement DataAPI.jl interface (which will be obsolete), but will be DataFrames.jl specific. What would be changed is name of the feature, which would more clearly convey the purpose of the provided functionality.

Would that work for your use cases?

I totally agree that the name change is probably all that is needed. I was arguing against some suggestions above and bringing up another use case.

I think we are converging on the right solution. I am not sure if the fields need to be restricted to String or not. I have a Unitful unit in my example that might be useful. There are probably other good use cases for non-strings, but I understand the desire to discourage anything too clever.

1 Like

Ah - right. The concept evolved based on the discussion :slight_smile:.

I think values in the dictionaries can be allowed Any, but we will clearly state (as we do now) that non-strings are discouraged, as when saved to disk they will be converted to strings anyway, so non-strings can be only meaningfully used in “per session” basis.

1 Like

Could multilingual (label/metadata) support be built in by default?
( to make it an “all-inclusive” DataFrame )

  • to define a (table/column) metadata per language.

example:

julia> colmetadata(df, :rating)
Dict{String, Any} with 1 entry:
  "label"    => "ELO rating in classical time control"
  "label_en" => "ELO rating in classical time control"
  "label_de" => "ELO-Bewertung in der klassischen Zeitsteuerung"
  "label_fr" => "Classement ELO en contrôle de temps classique"
  "label_pl" => "Ocena ELO w klasycznej kontroli czasu" 
  "label_hu" => "ELO-értékelés a klasszikus időellenőrzésben"
  "label_pt" => "Classificação ELO no controlo clássico do tempo"
  ....
  "wikidata"     => "Q105955"
  "wikipedia_en" => "Elo_rating_system"
  ...
  "description_en" => "The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess. It is named after its creator Arpad Elo, a Hungarian-American physics professor."
  "description_pl" => "Ranking szachowy, ranking szachowy Elo – metoda obliczania relatywnej siły gry szachistów w punktacji Elo. Nazwa ta pochodzi od nazwiska Arpada Elo, amerykańskiego naukowca pochodzenia węgierskiego, którego prace ukształtowały szachowy system rankingowy oparty na naukowych podstawach"
  "description_it" => "Il sistema di valutazione Elo è un metodo per calcolare i livelli di abilità relativi dei giocatori in giochi a somma zero come gli scacchi. Prende il nome dal suo creatore Arpad Elo, un professore di fisica ungherese-americano."
  "description_ru" => "Система рейтингов Эло, коэффициент Эло — метод расчёта относительной силы игроков в играх, в которых участвуют двое игроков (например, шахматы, шашки или сёги, го). Эту систему рейтингов разработал американский профессор физики венгерского происхождения Арпад Эло "
  ...
  "instance_of" => "rating system"
  "part_of" => "chess terminology" 
  ...
  • It is related to the Diversity ~ Julialang :juliadocs: :juliaspinner: “Diversity and Inclusion initiates” ( ~ language diversity )
  • And It has practical values for creating multi-language reports/tables
  • I prefer an unlimited key-value tagging possibility
1 Like

Ah, ok, I might have been a little hasty in my response. After taking a closer look at the proposed API and looking at the examples from @Nathan_Boyer and @ImreSamu, I see that changing the function names, e.g. from colmetadata to colnotes, (and changing nothing else from the PR) is merely a cosmetic change. If the function name was changed from colmetadata to colnotes, then the DataAPI docstring should probably be updated to say that colnotes returns an AbstractDict{String, String} rather than just an AbstractDict{String}.

However, I guess it would be ok to stick with the current naming (colmetadata, etc), as long as the documentation makes it abundantly clear that this feature is only meant to be used for column notes (and things of that nature), and is explicitly not meant for attaching arbitrary data to columns*. Other explanations in the documentation, such as “Only attach metadata to a column that still makes sense after subsetting the column,” would also be warranted. It’s not a perfect solution, but I guess I could live with it.

*The difficult part is defining what counts as “column notes” and what counts as “arbitrary data”.

(And maybe I leapt too quickly to the correctness argument. If the implementation does what the documentation says it does, then it’s correct.)