I think @bkamins as usual did a fantastic job. It doesn’t seem that any of us have generated a better proposal of what we are here defining as metadata.
So the fact that we have what seems like the best implementation of this but that it is still so problematic should give us serious pause.
That said, my opinion on this whole issue is much softened by the metadata being opt-in. I still have significant concerns about “getting stuck with it” and what will and will not be considered breaking for the API, but to be honest I agree that I don’t think merging this as opt-in will lead to any real catastrophes.
style - This reads to me like the “style” of the metadata which you could support with values like :note, :statistic, etc. However, I think it is good to directly set the propagation method like you have it, so I’d call the keyword argument prop or propagation or propmethod.
:no - This sounds like a Boolean option to me, so I’d call it :none instead.
Maybe there can be a function to overwrite the default propagation per session or per data frame (in addition to per metadata)? set_metadata_propagation(:pass[, df::DataFrame])
After reading the back-and-forth here, I think I’ve returned to the opinion that the feature should be named notes instead of metadata. As has been discussed, metadata fields which contain any function of the data are quite unlikely to be correctly propagated by the proposed rules. Therefore, we should be more explicit in our restriction of the feature to only metadata fields that contain notes about a column.
Of course it is possible to encode arbitrary data in a note string (let alone in a Dict{String,Any}), so to be extra explicit, the documentation and/or docstrings should contain a statement like the following:
Column notes should not contain any values that are a function of data in the DataFrame.
For what it’s worth, as a longtime Stata user, I’d like to caution against trying to model DataFrames too much after the tabular data in Stata and similar software. The point is that the tabular data there is much more than just a type: it’s a distribution format, a set of things that you can keep in memory at one time, and a mini-documentation at the same time. None of that applies to our discussion here, I believe.
My own experience with metadata (in particular variable labels) in Stata is that I’ve learned to distrust them, particularly because they’re likely to be wrong at the time when they’re being read: they were often set by a well-meaning research assistant, but then underwent a series of transformations, that (perhaps?) invalidated them. I can certainly see the benefits of using them diligently, but I’ve also learned that they’re most often not used diligently. So I’d be either in the “no metadata” camp, or in the “never propagate”.
If @bkamins thinks that metadata are important to get DataFrames more widely adopted, then I would support that movement. But I cannot help but feel that whatever format for metadata (or notes) we choose, there will be people that will be unhappy with them and will want the format to change.
The names are temporary. When we finalize this discussion then in the DataAPI.jl PR I will ping you to discuss final naming (as the discussion might give guidance to naming).
We will add convenience functions in DataFramesMetadataTools.jl (or similar package).
It is easy to do this, if one wanted in the future (it is just that for now we settle with two styles of metadata, in the future many more styles can be defined). Let me give an example. We could define style to be :mean and if DataFrames.jl encounters this style then it computes mean after any transformation and puts it as an appropriate value. We could even imagine allowing passing a function as a style in the future, so one could pass x -> quantile(x, 0.25) as a style to get 25% quantile.
I do not want to decide now if we will allow such things in the future. What I want to say is that the design proposed now does not preclude such metadata.
I plan to document, as a hint to the users that if one uses :pass style (or whatever name we will give it - my initial idea was to call this :note, but @nalimilan felt that :pass is better - we can decide the name later) then such metadata should not be a function of data.
This is my biggest fear in the whole design. That is why I propose that :pass (or whatever other name) is not set by default. And I hope that if someone sets style=:pass then this is done on purpose when propagation is desirable.
Sorry for going off on a bit of a tangent here. For those looking into using an external tool for metadata, I have found that the combination of FileTrees and DataFrames fulfill all my (admittedly quite modest) needs for metadata.
The key is that FileTrees itself is quite poweful when it comes to quering and selecting which parts of the tree to apply functions to, so I can easily keep the metadata inside whatever type I need and have the DataFrames on the side while still maintaining whatever coupling between them I need through the tree structure. I frequently use it for attaching metadata to sets of DataFrames in the same manner as well, and I don’t think this is something which would be in scope for a native DataFrames solution.
There is also DataSets which seems to take a more rigorous approach, but I have not used it myself.
Disclaimer: I have not used any other table tool except DataFrames so there is a chance I don’t know what I’m missing w.r.t metadata.
Despite this post not showing many signs of it, I have read the whole thread and I have full confidence that whatever is put into DataFrames will be great. Don’t treat this post as a “no” vote.
Question, I saw discussion about attaching “textual” (or, possibly, Any) metadata/note, what about attaching a function? (e.g. to recomputute some statistics, or co.cet some column when another is modified)?
I understand this could be problematic to read from file or store to it, but if the function is “live”, could it be useful?
Apologies for possible disturbance, of main discussion,
Could you develop? Does that happen because Stata propagates metadata in too many cases? In the DataFrames proposal, metadata wouldn’t be propagated after an arbitrary transformation is applied (unless the column name remains the same – that’s the topic of a sub-debate). Would the kind of metadata you’re thinking about even be incorrect to propagate e.g. after taking a subset of rows/columns or joining datasets?
No specific use case, just wondering about interest and feasibility - notably as eg including new “style=:reactive” in your proposal, as starting point for user implementation. My first two ideas were
automatic statistic update where subsetting
automatic additional column update where subsetting or merging
Note this this would be useful if there would be an automatic way to register those callback, but I dont see how in the cureent frame. At leawt, may be as manually user initiated?
I understand it would be more complicated that Dict of textual values, which seems the most useful usage (eg for units). My wondering was about just keeping that in mind in specifiying your API, as I did not see any mention of the possible “function” aspect inthe previous discussion. But I would not like to derail or lengthen the main discussion by this side idea.
The answer is that all this is technically possible and the proposal is ready for adding such extensions in a non breaking way.
However, I would prefer not to discuss them now (apart of stating that this is possible in the future, i.e. the design is flexible enough to allow for handling them).
The reason is that, as you see in this thread, it is very hard to reach a consensus about a functionality that many users say they want.
Therefore, to summarize without going into details, the approach we propose to take right now is:
make a design of DataAPI.jl + DataFrames.jl API in a way that it allows for virtually any extensions in the future in a non-breaking way;
initially implement only two kinds of metadata handling (“no propagation”, and propagation in “notes” style), as “no propagation” is a kind of null-model for metadata propagation, and “notes” propagation is something that users request that they need.
the initial API for this reason is low-level and includes only what must be included to support metadata in general.
There is a plan to have an extension package (tentatively called DataFramesMetadataTools.jl) that will add high-level API for users actively working with “notes” style metadata to make using such metadata convenient for them (but we will not add this high-level API to DataFrames.jl nor to DataAPI.jl - these packages will only have a minimal set of functionalities that ensure that it is possible to work with metadata).
I’d like to suggest an alternative option, that to wait again some longer, maybe more 2~3 years or so.
I’d expect that mean while, some “effect system” (like the one with Koka) find its way into Julia, then DF involved operations (arithmetics & others) could be invoked in contex-dependent manners, so that metadata propagation can be implemented per “effectful” semantics in the larger data manipulation context, thus well functioning in all expected cases.
In the future in general yes. My initial thought is that e.g. DataFramesMeta.jl could be extended to store every macro call expression on a data frame in some metadata key.
It is out of scope of the current proposal because, as I have commented above, we want to have something more basic to get started with.
Can you please point me to active discussions on adding such functionalities to Julia where I can learn what and when is planned to be added? Thank you!
Indeed, it’s because the metadata is propagated in too many cases. I agree that it’s more conservative to not propagate (and I’d prefer that). My concern is that many people will decide to propagate, but more out of laziness (and then we’re back to the old problem). But maybe I’m too pessimistic here.
Possibly, depending on what people put in there. Imagine you have data on employees with a variable gdr that contains the metadata/note "gender", and you merge them to their employer firms, keeping only the employee with the highest wage, then keeping that metadata as is would probably be confusing (whose gender? The unit of observation is the firm…?). One could argue that it’s still better to keep it, because at least it tells you that gdr stands for “gender”, even if it doesn’t tell you whose gender. So it’s a gray area.
All that said, my own preference would be to not allow propagation, because that minimizes the risk of getting information that’s wrong (but I can see that many people could be unhappy about that). But I think you guys are in a better position to judge the various tradeoffs.
This is a hard choice indeed. I was considering it a lot, and my conclusion is:
if someone does not want to propagate then such a person will not turn on propagation (so all people who do not propagation should be happy with the current proposal)
if someone wants propagation and knows its consequences (like e.g. @pdeffebach) then they also what they want.
then there are some users that think that want propagation but do not understand its consequences. What we can do for them is: a) provide appropriate documentation, b) wait till they turn into type 1 or type 2 above after getting the experience.
If Julia core development team wanted to rule out functionalities that would lead to type 3 of users then we would never get multi-threading support or @inbounds or @fastmath or integer overflow in Julia. Therefore my conclusion is that it is OK to add this functionality.
In this example, metadata still sounds like a “win” You have the uninformative variable name gdr and a label that tells you it means "gender".
I understand metadata should never be blindly trusted or communicates everything there is to know about a dataset. I hope people do not use it that way. But at the same time I don’t think it’s DataFrames.jl’s job to limit functionality because people may misuse it. Its DataFramesMeta.jl’s job to provide a broadly useful feature and document what are good and not-good uses for it.
Yeah I think part of the problem is the habit that some people have of treating metadata (variable labels in this case) as a substitute for appropriately documenting datasets.
On many occasions, I have had people hand me just a flat stata file with a couple of vague notes in the labels and then expected me to know what to do with it. Of course, then you’re then stuck wondering which of the variable labels correctly describe the data that you’ve been handed, and which ones were set earlier in the processing of the data and were simply propagated through, precisely because there’s a strong correlation between people who don’t bother to document the data before handing it off and people who don’t bother updating the variable labels after they make a metadata invalidating transformation of it.
Of course, this is more a problem with other people being sloppy and not wanting to spend their time writing documentation, not with the utility of propagating the metadata.