DataFrames.jl: metadata

You don’t read enough papers then! It’ll be right there in the table as sales_{i,t} and you’d have to read the caption to know that it was really log sales. But enough about that.

2 Likes

Let me add one more comment about the design of metadata in DataFrames.jl to help users decide what they prefer (if we decided to keep it):

  • it is easy and cheap to drop all metadata and just not use it;
  • if you do not use metadata it has no performance impact of operations or memory footprint of objects;
  • metadata is not used for any logic of processing data in DataFrames.jl.

This means that metadata is for users who want it. If someone does not want to use metadata then all discussion about metadata is irrelevant for such a person as metadata will not affect workflow of such users at all.

This helped me realize that I probably should clarify my original question a bit. I am not asking if someone wants to use metadata (because if someone does not want to use it the discussion we have is irrelevant for such a person). The question I asked should probably be read: if you want to use metadata how do you want it to be handled (and there are three options listed).

Of course this is not 100% correct thinking since as @jling commented - adding any functionality to the package leads to “feature creep” to some extent, but with metadata the API is, in my opinion, quite simple, and - what is more important - if someone does not want to use it then the API can just be ignored.

5 Likes

In my opinion metadata for tabular data deserves a more general treatment that works with any Tables.jl table. The definition of metadata itself can vary across communities and increasing the complexity of DataFrames.jl with a specific set of rules doesn’t seem to pay off.

Can you please give examples of metadata considered so far @bkamins? How is this information used in practice?

And this is the approach we take. See https://github.com/JuliaData/DataAPI.jl/pull/48 defining a Tables.jl API for metadata. However, as @pdeffebach noted it is impossible to define metadata propagation rules at Tables.jl level, since propagation rules have to be implemented in functions that perform table manipulation.

This is the point of @pdeffebach comments as he has a lot of experience with using metadata, so probably he can comment more on this.

In general there are two types of metadata:

  • table level metadata; here the most common metadata will be things like: table caption, table source, table license;
  • column level metadata: here I expect metadata like: column label (a verbose version of column name) and various comments how column was calculated or some specific issues with its interpretation (this becomes relevant when your data frames have thousands of columns as @pdeffebach commented)
3 Likes

Not similar. The reason it was deprecated was because it was impossible to define the behavior because you can do arithmetic and the value semantics were only defined for the arithmetic. If you do x .+ y with two DEDataArrays, which one do you grab the metadata from? If you choose a rule like “always x”, then the solution of putting a DEDataArray into a program is dependent on the ordering of the operations, meaning every possible change to any program is public API breaking. Thus there is no rule for making DEDataArrays work with broadcast in a general way, period. “Difficult to support every corner case” no, literally impossible.

People aren’t talking about doing arithematic with dataframes. You’re not doing a linear solve c = A\b with b a dataframe and such, so those issues are not related. And the ones here are easily solvable.

1 Like

Not linear solve, but for example:

(df1 .+ df2) .* df3

is a valid operation involving three data frames. And in this case the rule would be:

  • table level metadata is only kept for key => value pairs that are identical for all three tables
  • column level metadata is only kept for some column for key => value pairs that are identical for this column in all three tables
7 Likes

I think it is very similar because what to do if you concatenate two columns that has metadata with the same key, which is a very similar problem to x .+ y.

A little off-topic, but it can be done. We have a proprietary solver here that solves this problem by requiring to manually define a set of operations for each parameter you add to the structure (which is similar to the DEDataArrays). It is a set of “instructions” to tell the solver what to do with those parameters in all possible cases. It would be very difficult to implement and very restrictive in DEDataArrays given how wide is the DifferentialEquations ecosystem. My opinion is that it would be also very bad to require the user to do such definitions for every metadata in DataFrames.

I never knew that :sweat_smile:

Of course the example above is contrived, and probably not used in practice. The practical examples are e.g. coalesce.(df, 0) or clamp.(df, 0, 1) (and such data cleaning operations are quite common). And in these cases metadata both on table and column level would be just kept. What I understand from @pdeffebach comments is that in such cases he wants metadata to be kept.

1 Like

I already voted and I like the idea. Anyway I just wanted to give a shout-out to the way you develop and drive DataFrames.jl. Very commendable :pray:t2:

17 Likes

I haven’t followed the full discussion across all the repos but I haven’t seen a very thorough discussion of intended use cases. I’d want to know that before deciding how any such feature should behave.

Metadata for columns and/or DataFrames · Issue #35 · JuliaData/DataFrames.jl · GitHub makes me think it’s mainly for adding survey questions to survey response tables. What is your vision of how this will be used?

If the issue has been open for ten years, doesn’t it imply that it’s not an important issue? So alternative 4 would be to just close the issue.

1 Like

Would a more generic solution (to column level metadata) be to have a type for vector+metadata? Then if your dataframe had this type of column you’d have metadata for free with no changes to the code. And it can be reused in other table implementations or other data structures.

What is the cure for cancer or Alzheimer?

An important change of the context is that, as I have written in the initial post, Arrow.jl and Parquet2.jl packages support metadata persistence (and both these packages are also part of interoperability between Julia and other ecosystems that support Arrow/Parquet, which is essentially every ecosystem).

Before such packages metadata would be lost between Julia sessions, which made metadata much less useful.

3 Likes

Nobody is going to write “various comments how column was calculated or some specific issues with its interpretation” for each column if there are thousands of columns. So I don’t think that’s a very good line of reasoning. More generally speaking, I think if some code involves programmatic metadata for each column, it should be handled separately from the table, since the logic for handling that metadata will almost certainly vary from application to application.

How about a middle path where only specific predetermined metadata fields are supported, like label and source? The metadata fields that would be supported would be those “very stable” metadata fields for which the propagation rules are deemed to be nearly always correct.

1 Like

My two cents. I never like a package feature that “works as intended in 90% of the cases”. I find this unsafe. It also tends to mean that you need to guess how a user has used that feature and/or decided about borderline cases, even if you decide not to use the feature. In my view, providing information that is not necessarily what you expect is way more dangerous than not providing information. I’d end up thinking “has the user forgotten to re-label this variable or is this actually an accurate description?”

I think that DataFrames.jl should have all the features that are safe, correct, and general. Any other feature should be added to a different package. Overall, I want to know as a user that any feature of DataFrames is valid and unambiguous.

5 Likes

Regarding interoperability: Would the metadata capability of DataFrames.jl match all features of Parquet and Arrow? Would it be 100% round-trippable?
In the past I once had the feeling I wanted something for metadata in tables, but nowadays I don’t see the usecase anymore.