DataFrames.jl: metadata

Calling them “notes” is shorter, implies using strings, and also avoids this:

2 Likes

My thinking behind allowing Any value is that one could, for example, want to store md"" object as a note (or put it differently - notes can have other types than strings).

1 Like

Indeed we will have to insist in the manual on the fact that metadata is designed for information that is general enough that it doesn’t get incorrect after subsetting rows or columns. As long as users are provided with a definition of what metadata is expected to be and how it behaves, it doesn’t make sense to say that propagation rules would be “correct only most of the time, and sometimes incorrect”: they are always correct, we just have to ensure users clearly understand what kind of information they should store in metadata (and that they are not too tempted to misuse the feature – which I don’t think will be the case).

That’s an interesting extension. I’d leave this out for now as we have enough issues to tackle, but I imagine a special naming scheme could be used, probably with a more specific character than “_”, e.g. "label#LANG:de" => "ELO-Bewertung in der klassischen Zeitsteuerung". Actually as long as multiple packages agree on a common convention to set such fields and/or consume them, no change is needed in DataFrames. This convention could even extend to other languages as the metadata key names can be exchanged via Arrow, Parquet, etc. Stata already supports this, I’ll check whether/how it can be imported in Julia (see this PR). I don’t know whether there are existing naming conventions in Arrow or Parquet.

5 Likes

Let me summarize my current thinking (let me stress that this does not reflect JuliaData maintainers point of view - it is my personal perspective but hopefully it will be convincing) for clarity of the state of this discussion (and I would like to thank all the people that voted and commented as it is really a constructive thread; Julia community is amazing):

  1. “metadata” term seems too general. People might expect to use some metadata for program logic, while what we wanted to provide in DataFrames.jl was a way to store notes. This is especially relevant in context of Metatada.jl work (which might in future lead to a general metadata ecosystem) + what Arrow.jl and Parquet2.jl provide (where metadata is relevant, e.g. storing type information or storing summary statistics of columns for quick access to them).
  2. At the same time there is a strong need to have “notes” support. This is not needed by all users but there are large communities for which this is an essential functionality (and we do not want to introduce a barrier for migration of such users to Julia because we lack “notes” support; remember - if someone does not need “notes” this functionality can be just ignored and will not affect anything in existing codes). This is especially visible in domains that are close to traditional statistics (like economics, medicine etc.). In particular ecosystems like Stata or SAS provide a way to store “notes” in their source files and it is essential to be able to conveniently import and work with such data in DataFrames.jl.
  3. Apart from reviewing storage formats I have listed above @nalimilan has pointed the Dublin Core Metadata standard that is indeed well thought of. Here is a set of examples DCMI: Using Dublin Core. It turns out that essentially what Dublin Core calls metadata are “notes”. Also I think Dublin Core answers many questions in this thread what is meant by metadata in the sense that was originally intended (the standard is very precise in providing a specification of what it considers metadata and how it should be codified)

Therefore my conclusion is:

  1. we need to add some kind of support of “notes” to DataFrames.jl
  2. we also need to establish interoperability standards for these “notes” (i.e. decide how these notes should be saved/loaded in Arrow/Parquet/SAS/Stata/SPSS etc. formats)

The interoperability point is crucial for convenience of use of the system and requires agreement between maintainers of different packages.

My proposal is therefore as follows:

  1. The functionality designed in Metadata on data frame and column level by bkamins · Pull Request #3055 · JuliaData/DataFrames.jl · GitHub is suitable for “notes” (i.e. notes should be propagated, merged when matching). It is not suitable for other metadata in general.
  2. Therefore in DataFrames.jl we will call this information “notes” and the relevant functions will be notes, colnotes, hasnotes and hascolnotes based on Dict{String, Any} just like currently in the PR (why values are allowed to be Any - the reason is that, as long as we understand that these are notes this is sometimes convenient, e.g. to have copyright year stored as a number)
  3. “notes” are a subset of possible metadata space (in the future maybe Metadata.jl would support more general mechanisms for various types of metadata). The definition of which kind of metadata can be considered to be “notes” is that it is kind of information that is 100% correctly handled under the proposed metadata propagation rules (I think here the most important, and possibly debatable, rule is that if you have column :some_col and you transform it but in the output keep the same column name i.e. :some_col then you accept that notes are propagated; as @pdeffebach commented above, and I agree with it, this assumes that users re-use column name under “data cleaning” operations, but changes column name if the column significantly changes its meaning, as one can write :x => ByRow(x -> rand()) => :x and nothing can stop one from doing this, but the feeling is that in such cases if one has “notes” for :x and does such transformation most likely the target column name should not be :x)
  4. in the future we might add to operations specification syntax a way to add metadata propagation hints as @Nathan_Boyer suggested (but I would leave this decision for later after we have some experience with using “notes”)
  5. Importantly, if one does not want to use “notes” it will not have any performance impact on operations and it will be easy and cheap to drop “notes” in-place with dropnotes!. (so, in other words, users that do not need “notes” are not affected by this PR)
  6. All this should be very precisely documented and explained (with references to such resources as Dublin Core and examples of interoperability with Arrow.jl, Parquet2.jl, SAS/Stata/SPSS) so that users have a clear understanding what “notes” are for (fortunately it seems that the users that might naturally want “notes”, like Stata users, they will intuitively uderstand it in the way it is currently designed).
  7. The functions notes, colnotes, hasnotes and hascolnotes should be added to DataAPI.jl interface to allow for interoperability. This is a crucial difficulty as this is the only point where some agreement needs to be made. The reason is that we need to decide which metadata e.g. in Arrow/Parquet/SAS/Stata/SPSS etc. are considered notes.

For example for stat tables the general definition of metadata is (GitHub - junyuan-chen/ReadStatTables.jl: Read data files from Stata, SAS and SPSS into Julia tables):

struct ReadStatMeta
    labels::Dict{Symbol, String}
    formats::Dict{Symbol, String}
    val_label_keys::Dict{Symbol, String}
    val_label_dict::Dict{String, Dict{Any,String}}
    filelabel::String
    timestamp::DateTime
    fileext::String
end

and in Expose metadata via `DataAPI.metadata` by nalimilan · Pull Request #6 · junyuan-chen/ReadStatTables.jl · GitHub proposed a mapping of all of the values to table or column level notes.

What we would need is a similar consensus with Arrow.jl (CC @quinnj) and Parquet2.jl (CC @ExpandingMan) which metadata is considered to be “notes”. My thinking that all non-technical metadata (i.e. not something that is in reserved metadata namespaces or things like column statistics) can be safely considered to be “notes” as they only allow string values (so such values are essentially unstructured text).

I hope this proposal can be acceptable for the community (I know there are various cons of this proposal, but we need to reach some consensus respecting different requirements that users have and I believe the proposed solution gives both the functionality that is currently needed and leaves “metadata” term for a more general usage in the future, where we will understand that “notes” is one of kinds of metadata that allows the propagation rules that are currently poposed in DataFrames.jl).

15 Likes

Hard issues…

I’m hesitant about the “notes” terminology, as “metadata” seems more established in similar implementations (and more generally for this kind of notion). “notes” would just be a subset of “metadata”, and we need to be able to import/export from one to the other, so having two entirely different mechanisms and terminologies could be annoying. If other packages start implementing metadata support, will DataFrames have to support both notes and metadata? It would be weird to be able to attach e.g. a label using a metadata function to many types of objects, except for DataFrame.

Maybe it would be better to provide a way to distinguish different kinds of metadata, so that by default only metadata keys that are known to be safe are propagated? Either we expect that DataFrames will only ever support “notes” and never other kinds of metadata, in which case the problem of incorrect propagation could be handled by ensuring that when constructing a DataFrame from a Tables.jl object by default we only import metadata keys which are known to correspond to “notes” (e.g. labels, units, comments). Or we expect that DataFrames will also support metadata in general, and in that case it looks like we need to design a more general system which allows a fine-grained definition of propagation rules (like @Zach_Christensen proposed). The former option is of course simpler to implement, and it could be extended to the latter, as long as we use the “metadata” terminology so that we don’t have to rename functions later.

3 Likes

The easiest way to move forward now is to just use a more specific name so we can look up “notes”/“descriptors”/“annotations” within whatever metadata(x) returns in the future. I don’t have strong opinions what that name ends up being but I greatly appreciate the open discourse and responsibility being taken moving forward here.

If it is proving too difficult to settle on a single name for the public interface now we could have a set of distinct methods for accessing specific common “notes” (colsource, colunits, etc.). If you want highly specific behavior for propagating colunits that differs from colsource then you can build it in manually until we have a more general approach. This has the disadvantage of requiring manually defining each method in DataAPI or DataFrames, but might be more desirable if we want to ensure we can move forward now without hindering work later.

summary of my thoughts on metadata (it’s either premature or the wrong approach)

I will summarize some of the statements I have made in github since I don’t think I have given them here in discourse:

In my opinion it’s premature to add metadata functionality into DataFrames.jl. While I think we could all imagine our own uses for this in principle, in practice it seems that picking an implementation imposes arbitrary constraints such that it immediately fails for a large class of important cases. The problem is primarily that it does not seem possible to come up with a universally acceptable way for the metadata to transform under transformations of the column itself. There is clearly a huge set of such transformations and a wide variety of ways different types of metadata should behave under each of them.

For example (almost) any non-trivial operation on a column will invalidate all statistics which immediately precludes a large (and arguably the most important) class of metadata from any reasonably general implementation. What we are left with is largely formatting and “engineering” metadata. This is indisputably important, however I’m afraid matters don’t get much better from here. Important implementation and typing details are already stored by the AbstractVector type of the column itself, so it’s not clear to me why metadata should ever include any of this.

What we are left with looks suspiciously application-specific and it is no more clear how any of it should behave under operations. This screams out for specific implementations for specialized use-cases. We should take the hint:

The issue we are seeing is not the need for generalized “metadata” per-se but that DataFrames are not extensible enough. Currently, if a user wants to implement functionality such as the Stata or SAS examples that @bkamins mentioned there’s a very high labor cost to doing so. This is because, while we have gotten a lot of mileage out of the extensibility of AbstractVector, there aren’t practical ways of either defining an AbstractDataFrame or extending the functionality of DataFrame so that new behavior can be defined under tabular operations (I do not want to insist that what is needed here is an AbstractDataFrame with easily definable subtypes or a Tables.jl-like abstract interface, the best solution might be neither of these). In my opinion the question we should be asking at this point is not how to add metadata and define general rules for it but how to allow users to easily define behavior that accompanies tabular operations that are not easily expressible in terms of AbstractVector alone (i.e. mainly joins and groupby). This would allow us to ultimately have metadata without worrying about improper behavior that silently breaks a huge class of use cases and actually facilitate the useful transformation of included metadata in specific cases (which the current proposal doesn’t seem to provide any path to).

comments on compatibility with parquet format

The parquet format currently allows the following types of metadata:

  1. string-string key-value metadata associated with the entire table
  2. string-string key-value metadata associated with each column
  3. limited statistical metadata for each column (min value, max value, number of null).

3 is clearly precluded from any of the preceding discussion. My current implementation allows users to wrap the constructed AbstractVector column objects with ones that return the statistics directly when the appropriate functions from Base are called.

I believe that 1 and 2 are both compatible with @bkamins proposal with the important caveat that parquet is an on-disk format and doesn’t have any conventions for table operations. Therefore, even with parquet (which in the static case has a nearly identical idea of what metadata is as this proposal) one immediately runs into the aforementioned problem where it is very easy for a user to get a parquet with metadata which is inappropriately conserved under table operations which invalidate it.

I don’t believe I would have any trouble implementing the DataAPI functions for fetching the metadata from the parquet table to make Parquet2.jl fully-compatible with this proposal so I think we are good; though I should give the warning that I have not yet tried to implement this so I don’t want to rule out the possiblity of some ambiguity I haven’t understood. If the DataFrames devs want me to do a quick implementation of this for Parquet2.jl against a development branch I’d be willing to do that

naming conventions

In light of my above arguments I like the idea of having naming conventions which seem more specific to the implementation in this proposal. Indeed calling it “metadata” seems a bit aggressive since, as we have seen, much (most?) of what one might colloquially call “metadata” is precluded from this implementation in one way or another. I like the term notes.

appreciation for DataFrames devs

As much of the preceding discussion was rather negative, I feel compelled to show my appreciation for @bkamins, @nalimilan and others who have done an amazing job on DataFrames.jl. I truly believe this is the best in-memory dataframes package in any language, and the authors have done a fantastic job getting it to that point especially by carefully surveying existing implementations and keeping their best features. I think DataFrames.jl has now matured and it seems like development on the core functionality is more-or-less done, and I fear we are now getting into a phase of feature bloat. It’s ok to slow down at this point.

7 Likes

Given the post by @ExpandingMan, I want to insist on a point I made before. I insist on it, since introducing changes today, without respecting some basic principles, means that 1) you can’t change them in the future 2) you potentially start compounding issues, thereby creating a snowball effect. Given Julia is a new language, it can avoid mistakes from other languages.

The issue with metadata is not about potential bad uses of a tool by specific users, but ambiguities inherent to its implementation. Just like @ExpandingMan indicates, this tool can be really useful, and even some users can find it almost critical. But, as he also indicates, rushing to add this tool to DataFrames seems an unnecessary burden.

I think that when you have tools that are either problematic or that you can’t reach a consensus regarding their implementation, the tool should go through a trial stage. This means introducing them in a different package, so that 1) users that desperately need this feature have it available, 2) you learn and get more information about its implementation , 3) you keep flexible, in case there are new developments.

More generally, I think there should be a few principles that should be respected in DataFrames, no matter what. In my head, there are three rules that a data analysis package should always respect.

  1. Do not directly introduce features that are problematic/lack consensus. First, implement them in a different package, until you learn the best path to follow.
  2. Do not abbreviate names. This seems a quibble and innocuous if you do it once. However, once this is done in the main package, everyone starts to do the same in their own code/package. It’s like, unconsciously, users feel that doing this is allowed/encouraged. So, I find na.rm=TRUE in R bad practice, relative to skipmissing in Julia.
  3. If possible, do not add a multiplicity of methods for doing the same. At least, don’t do it based on that it’s more convenient for some users. This becomes important when you have to read code written by other users. Otherwise, you have to learn how you’d write it + how other users would write it. For instance in Plots, where you have aliases for attributes, and hence one million ways to write color=:black (including c=:black, which goes back to my point 2) ). For instance, I’d think twice about pushing much operations by rows as part of DataFrames. I can understand its relevance, especially for those coming from R or Stata. But, if this is the reason, I’d make this part a separate package or of DataFramesMeta

In fact, I think there could be some value in making these rules explicit. I’d find it a little bit confusing/contradictory why DataFrames is so strict about for instance the use of missing values, without caring so much about being lenient with metadata.

I completely trust the developers of DataFrames, who are doing an excellent job, and leave this as a suggestion. I’m relatively new to coding, and actually I’d say that I learned the importance of these principles by using Julia. In fact, I use this opportunity to thank @bkamins, because his answers and blog have taught me a lot about coding.

Overall, I want to emphasize that I understand the potential relevance of the tool for some users. However, that shouldn’t be the discussion here. The discussion is whether the tool should already be part of DataFrames.

While I agree with comments from @ExpandingMan and @alfaromartino that this feature could be delayed pushing into DataFrames.jl because of much larger impact but implementing the feature into another package may not work for some of us. For example, TSx uses a DataFrame as it’s only property which was a critical design decision, hence, there is no way to store metadata/notes (I am unbiased towards the name) to a TS object even if one uses another package.

Metadata, in itself, is not a very complicated thing for timeseries data just a few strings containing last updated timestamp, source of the data, license, etc. at the table level. Maybe some users want this at the column level but I am not sure as of now. For now, a regular Dict(String, String) would work at the table level. Even if this information is not propagated at the DataFrame level the propagation rules can be written inside the TSx methods but some storage is definitely needed.

What could be a possible alternative solution to this problem if DataFrames.jl doesn’t get metadata functionality right now?

1 Like

Are you sure this would be hard to implement?

Shouldn’t the designer of a package — say RichDataFrames.jl — be able to have a wrapper type RichDataFrame <: AbstractTable (a struct with a DataFrame and some metadata) and be able to implement the tables interface with a series of methods like:

function select(rdf::RichDataFrame, args...)
    original_metadata = metadata(rdf)
    df = stripmetadata(rdf)
    new_df = select(df::DataFrame, args...) # use predefined method
    handle_and_add_metadata!(new_df, original_metadata)
end

function join(rdf1::RichDataFrame, rdf2::RichDataFrame, args...)
    ... # strip metadata
    join(df1::DataFrame, df2::DataFrame)
    ... # handle and reattach metadata
end

# etc etc

…? (I may be optimistic here!)

On this note, as a user, I would much prefer having DataFrames.jl and RichDataFrames.jl be separate packages in this way. (Modularity is better than a monolith.)

Responding to @chiraganand and @alfaromartino and @Jollywatt:

TLDR (and my understanding):

  1. We ill most likely add metadata to DataFrames.jl. This functionality is needed in many end-user workflows + packages that want to do extensions using composability approach need metadata as @chiraganand commented (and we badly need more work in time series and panel data domains in particular to improve user adoption - this work will happen outside DataFrames.jl but DataFrames.jl must be ready to support it).
  2. The storage mode (Dict{String, Any}) on table and column level does not seem controversial and allows us to support persistence/retrieval in Arrow/Parquet/SAS/SPSS/Stata.
  3. The only debatable part is how and if metadata should be propagated. Here we are thinking with @nalimilan what is best and will communicate this when we are ready, as the design is not simple.

More detailed comments in response to @alfaromartino:

  1. if we introduce something we will provide metadata propagation mechanism that most likely we will not have to change in a breaking way in the future (but still we will label it as “experimental” to warn users that metadata must not at this stage be used in program logic - it should only be used as “helpful notes” till we reach final design)
  2. We do not rush to add this tool to DataFrames. First - the topic was open for 10 years, and resurfacing every several months. Simply until someone made a PR and put something on a table there was not much traction in the community to make a final decision. Once a PR with a proposal is on a table I have opened this discussion here exactly not to make a rushed decision but rather to reach consensus before we merge and release anything.
  3. Why it is highly problematic to add metadata propagation in a separate package. The reason is that it is overly optimistic to assume that you can take an output of a vcat, innerjoin or select and make a decision about how metadata should be propagated post-hoc. The reason is that the logic, unfortunately, is complex. The crucial issue is column names. These functions have various ways to rename columns that would be hard to trace-back after the operation, e.g. makeunique kwarg in many functions or renamecols in joins. It of course would be doable to re-create metadata propagation logic in some wrapper functions, but it would be quite compex and hard to maintain. Therefore - if we want reliable metadata propagation it has to be hardcoded into the functions in DataFrames.jl.
  4. Since, for the reason mentioned above, we cannot go with another package handling this, we need to add it to DataFrames.jl. Therefore, let me stress it again, 1) the design that we will introduce will be such that it will be very unlikely that we will have to break it in the future, 2) we will mark the functionality experimental; this is a standard approach to signal to the users that things might change based on the experience with working with them (see Threads module in Base Julia for an example of such an approach).
  5. You might then wonder why we do not work hard enough to work out the best rules now and then implement them when we know what we want. This is unfortunately infeasible in practice. Unless users are exposed to some functionality they are very unlikely to complain (see e.g. warnings in Julia 0.7 that were turned to errors in Julia 1.0 - and guess when people started to report problems).
  6. Do not take an impression that we are not strict about metadata. We are 100% strict about it and if we implement something it will be codified (it already is codified in my PR and based on the discussion we will update the contracts to be precise after the changes). Currently there are two metadata propagation modes: no propagation and propagation following the rules that are currently implemented in the PR. As @Zach_Christensen noted - there might be other rules designed in the future (we do not know this yet). For now these two sets of rules are on a table. “no propagation” - it is a default as it is 100% safe, the “current propagation” was designed to support the “notes” kind of metadata. In particular we will clearly document how the decision will be made which of the rules is used for which metadata and we will document how we are making DataFrames.jl ready for other propagation rules/changing what and when is propagated for the future in a non breaking way (even if we do not implement this now).

As a side note, I really appreciate all these comments as implementation of a hard functionality (staying open for 10 years) is not easy and requires discussion. Even having to write down what we think helps both us (to implement things) and the community (to understand what are our intentions/underlying concepts).

6 Likes

Related to 5. and 6., I want to make it clear that I’m completely aware that you’re making these decisions after you thought about all the possibilities. As I said, I completely trust the judgment of the contributors of this package, and I know you’ll make the best choice. You, @nalimilan, and @pdeffebach are making a terrific contribution to this package, including your answers on sites like this!
All my posts only have the goal of providing some ideas/thoughts, which could be or not relevant. Ultimately, only the people that work on a package know what’s feasible and what’s not, as long as the tradeoffs involved.

3 Likes

This is exactly what we need at this stage. Thank you!

My two cents concerning the “metadata” functionality (without worrying about propagation):

  • the way to use and propagate “metadata” is definitely application-specific, and is in no way related to the DataFrames.jl package (quoting @bkamins: “metadata is not used for any logic of processing data in DataFrames.jl”), hence the functionality should be provided by a separate package;
  • however, such external package would hardly integrate transparently with DataFrames.jl since the latter expose a complex interface based on several methods:
julia> length(methodswith(DataFrame, supertypes=true))
308

The consequence is that in order to exploit the “metadata” functionality we would likely be forced to use an alternative interface for the DataFrame object. Taking as an example the above-mentioned Metadata.jl package:

julia> using Metadata, DataFrames
julia> df = DataFrame(id=[1,2]);
julia> mdf = attach_metadata(df, (x = 1, y = 2));
julia> names(mdf)  # <-- can't use DataFrame methods on the wrapped object!
ERROR: MethodError: no method matching names(::Metadata.MetaStruct{DataFrame, NamedTuple{(:x, :y), Tuple{Int64, Int64}}})
Closest candidates are:
   ...

(the purpose here is of course not to blame Metadata.jl, but just to provide an example…).

In summary: “metadata” should be implemented in an external package, but implementing such package would be unreasonably hard. This issue has been discussed in several posts on discourse, e.g. here.

I’m afraid this whole discussion highlight some kind of limitation in the Julia type system :sob:, or at least some difficulty in implementing a facility which is conceptually very simple.

Concerning propagation: I believe having something working in 90% of cases (or even 99%) is a very dangerous option which I really hope will be avoided. Hence my choice is #2: Add metadata to DataFrame [(because it is very useful and would hardly be used if implemented in a separate package)], but never propagate metadata.

I think that this proposal, like all pieces of code, aims to work as described 100% of the time. It may happen that 1% or 10% of the time the user does not find the described behavior useful or intuitive, but that is hardly the fault of the code.

1 Like

I’m a bit doubtful how it should work that users do not rely on the new feature in their program logic. I’m not sure what code there is that doesn’t have program logic. I think once it’s there people will definitely build on it, even if that’s not intended.

I am still too unclear on the use cases, and would like to see concrete examples of where this is useful and where it isn’t.

To allow people to develop those examples, perhaps this feature could be released as experimental and subject to removal, either within DataFrames.jl or as another package pirating or forking DataFrames.jl.

Where it would be useful is clear: for example, be able to use column labels, or save information about when, where or how the data was observed; the former is already loaded from e.g. Stata, SAS and SPSS data files by StatFiles.jl/ReadStatTables.jl. Where the feature isn’t useful: well, in lots of cases, but that’s not the question as long as it doesn’t hurt these use cases.

2 Likes

Is it fair to say the main use of that would be social survey/administrative data analysis?

I can confirm it would be useful there since that’s my research field. But I suspect metadata can be useful in many more areas as it’s common to have to note information about which experiment a dataset refers to, or how a variable was measured. Maybe others can tell.

1 Like