After another round of brainstorming the current proposed approach is as follows:
when creating a metadata entry it will be required to pass its “propagation style”
for now two propagation styles are supported in DataFrames.jl :no and :pass, where :no means no propagation and :pass means propagation following the rules implemented in the current PR. If user passes an unrecognized propagation style, then it is treated as :no (to allow for cases when some other tables define other propagation styles than these two styles in case they were not also implemented in DataFrames.jl)
AFAICT this approach meets all requirements that were presented so far, as there is a fine grained control how the user wants to propagate metadata (in particular, it will not be propagated by default, you have to choose :pass to have it propagated). The reason why style is a Symbol is that it puts least burden on a compiler and does not use-up namespace.
The API for working with metadata will be as follows:
metadata(table, key, [default]; full=false) get table-level metadata for key value, if full=false return just value, if full=true return a tuple (value, propagation_style); if metadata for key key is not defined throw an error; if default is passed return default instead of throwing an error if key is missing
metadatakeys(table) get an iterator of table-level metadata keys (in DataFrames.jl it will be KeySet); if table-level metadata is not supported return ()
metadata!(table, key, value; style::Symbol) set table-level metadata for key key to value with style as propagation style (currently supported in DataFrames.jl :no and :pass; unrecognized style is treated as :no)
colmetadata(table, col, key, [default]; full=false) get column-level metadata for column col for key value, if full=false return just value, if full=true return a tuple (value, propagation_style); if metadata for key key is not defined throw an error; if default is passed return default instead of throwing an error if key is missing; if col is missing an error is always thrown
colmetadatakeys(table, col) get an iterator of column-level metadata for column col keys (in DataFrames.jl it will be KeySet); if column-level metadata is not supported return ()
colmetadata!(table, col, key, value; style::Symbol) set column-level metadata for column col for key key to value with style as propagation style (currently supported in DataFrames.jl :no and :pass; unrecognized style is treated as :no)
As a laboratory scientist who has already implemented a crude work-around to implement meta-data along side a DataFrame, I appreciate this proposal to add meta-data to DataFrames.jl. The proposal in #3055 would work really well. The transformations in :pass reflect the way I currently combine and process meta-data. Two thumbs up!!
we can just define additional style values and they can do whatever we decide we need without conflicting with what we have now. It is important that we require an explicit passing of style (so in the future changes will be non-breaking for existing code).
A major motivation for propagation by default is to reduce the mental overhead of dealing with metadata. Defining propagation rules every time you set the metadata is still quite annoying, but better than dealing with it for every join etc.
I’m still in favor of propagation by default but if this PR is what gets merged, and we can change propagation rules in the future then go for it.
It is required for non-breaking extensibility of metadata in the future.
I assume that DataFramesMetadataTools.jl package will be created to simplify this process soon (exactly following the recommendations in this tread: put in DataFrames.jl only what must be in DataFrames.jl and put the rest in an extension package). I invite you to collaborate in its design/implementation. In this package I assume we will define more user friendly methods that will be convenient, like label (or tools for saving metadata in CSV). The point is that in DataAPI.jl and DataFrames.jl we need to have maximally generic definitions so that we do not need to introduce breaking changes.
I agree with pdeffebach. The latest proposal is fine if deemed necessary to move this forward, but I don’t see the benefit of using this API and choosing :no.
Maybe someone in the :no-by-default camp can explain why the proposed API is better than separate dictionaries?
I can explain this on an example. In some table storage formats, like e.g. Parquet2.jl, you store various metadata, some of which are style=:no, some of them are style=:pass metadata, and for some metadata it can be even unclear how it should be propagated and user should decide it.
Therefore, for example, I want to leave to @ExpandingMan who is maintainer of Parquet2.jl a freedom to decide how metadata style will be tagged (I am even OK, if it is decided in Parquet2.jl to tag all metadata as :no-style) and then in DataFramesMetadataTools.jl add convenience functions that make it easy to change metadata style (it will be only one extra line in the code).
The point is:
I want source formats to be sure they can safely expose all metadata they store.
I want to minimize the discussions in these packages (like Arrowj.jl, Parquet2.jl, SAS/SPSS/Stata readers/writers) how metadata should be tagged. Instead I want to allow respective maintainers to make the decision they like best (of course assuming it is documented) and then give users an easy tool to fix it.
The point is that if we go this way we will be able to move forward with metadata in a constructive way relatively fast. Otherwise we will have super long discussions (like this one) for every new package that decides to support metadata.
My objective is to have one long discussion (the discussion we have now in this tread) and make sure that after this discussion package maintainers can move forward fast because they know the general design rules that can be easily followed and do not put too much restrictions on them.
The problem with “dictionaries” that you propose is that they (i.e. persistence package maintainers) would need to have to decide upfront which metadata is propagated and goes into a data frame, and which metadata is not-propagated and goes into a separate dictionary. I prefer to ask them “just pass all metadata and tag it with style that you like” and later allow users to easily update the style, when metadata is already in a data frame, the way they like.
The propagation_style suggestion seems like an improvement to me.
Indeed, :no by default seems essential for on-disk formats such as arrow and parquet, without it we might be stuck not even extracting the metadata by default (which would defeat the purpose of having it in DataFrame). The problem is that e.g. a parquet can have metadata which is not well-behaved under transformations as defined in @bkamins proposal. It is entirely reasonable for them to do this since parquet is a static format. Therefore one must assume that propagating the metadata from the static file is erroneous.
This is essentially why I still don’t think somehow tacking metadata onto dataframes is a good idea.
We can’t figure out a consistent set of metadata transformation rules that is not likely to constantly break.
To solve this disable propagation.
So what’s the point?
I again emphasize that even if you enable propagation it will only be useful in a relatively narrow subset of cases (i.e. those for which mutation does not invalidate the metadata).
Really the more I think about this the worse it gets. Another thing to consider is that the columns themselves are mutable, and the interface does not currently “gatekeep” their mutation (which I consider a really good thing about the existing DataFrames API) so even if you could come up with some way of telling the DataFrame when it should invalidate data you can’t implement it without spoiling one of the best features of the interface.
I respectfully suggest that some of the desire for this feature is motivated by myopic consideration of specific cases. There are all sorts of data structures which are not tables and Julia is already quite generous with the tools it gives you for associating initially separate objects (e.g. struct, NamedTuple, Dict).
I think it would be productive to get into exhaustive detail of specific examples for which the current proposal might be useful. For example, I would like to hear more about
Considering the preceding conversation I very much suspect that this has in mind a very specific implementation that only works because of the specific transformations being performed. Details about why this may or may not be the case would be helpful.
Essentially, I have raw data in several CSV files. I load them, join into a single data frame, then do some processing like subset, change units, compute other quantities (columns), etc. It would be nice to attach metadata about the data collection near where I load the CSV and have it stick around for when I eventually save/print a final focused/polished table.
Maybe you can expand, just for the sake of discussion: which part of this workflow goes wrong if you try to store the metadata separately in a Dict{String, Dict{String, Any}} ? If the fields are static then you can still add to this dict as you load CSVs, and then at the end do something like fullname * "(" * metadict[fullname][unit] * ")"
This seems draconian. Metadata is going to still represent the data as is under the vast majority of modifications to a DataFrame. Something like "Sales (USD)" is going to survive a subset, adding or dropping columns, a join, an hcat fixing missing values etc.
Also, as discussed above, the term “break” is not quite correct. It’s a tool and has defined behavior. It’s up to the user to work with metadata responsibly. But on balance, it’s easier for the user if DataFrames.jl propagates.
To be honest, I haven’t tried using a separate Dict yet, because that is an idea I only just got from @jules in this thread. It might work fine. I guess my workflow is more of a reason that I want propagation to work easily if built into DataFrames. The built-in functionality could definitely save some boilerplate over the Dict method and provide extra conveniences. If it favors safety so strongly over convenience, I might be better off with the separate Dict; I’m not sure yet.
This is a point of a current proposal. You only get propagation if you set style=:pass. And if you do - you know what behavior you subscribed for. You can safely just set style=:no and then you are sure nothing bad will happen.
and to answer @Nathan_Boyer about what he is not sure:
Dict will fail in the following cases:
if you rename columns then you need to update the dictionary (easy, but requires 2 steps that users will not want to do manually)
if you do any operation involving more than two tables (join, hcat, vcat) then you need to think hard how to merge metadata (not easy)
in general it is cumbersome to repetitively do operations like:
do some transformation of a data frame
do an equivalent operation on matching metadata dictionary
if it is only one operation then it is fine, but assume 10-step @chain of operations.
So in short:
it is fully possible to have a separate Dict for metadata (this is essentially what we do; just a reference to this Dict is stored internally and the operations happen automatically if you opt-in for it);
however, the point of keeping metadata attached to a DataFrame is that
you do not have to think about moving around both data frame and data frame metadata objects in all operations you do (imagine having 10 different tables you work with, some of which are temporary)
you do not want to have to constantly remember about properly propagating metadata (it is simply something that, after you need to write 100 transformations in a session, you wish it were automated somehow as this is pure boilerplate code that is error prone)
I emphatically disagree with this point. The only reason you can even claim that is that we have already implicitly excluded a huge class of metadata (including essentially all statistics). Considering that many crucial implementation details are already covered by the underlying AbstractVector this leaves precious little. You are picking out an incredibly narrow set of use cases and then claiming that it works well in general because it works for those use cases. My point is that you’ve already excluded too much.
My assertion is not that metadata (by some appropriate definition) is not useful. It certainly is. Instead my problem is that it seems to have no place in DataFrames because we’ve quickly ruled out that the DataFrame can do you many favors in the general case.
We are seeing a very real problem here, but again, I believe that problem is that DataFrames are still not extensible enough, it’s still very hard to define any behavior that survives joins and groupbys (the operations that distinguish dataframes). I’m convinced the problem is real, I’m not remotely convinced that tacking some dicts inside of DataFrame is a good solution to it.
I am one of those who thinks statistics are data not metadata. I don’t really see the point of storing statistics as metadata, since they so often need to be recomputed and are easy to recompute. If an operation destroys my source or unit information without a way to recover, then I am in big trouble. If an operation destroys your mean, then your data is still there and you would have needed to recompute it anyway.
I agree with that. With no system I can imagine would it be possible to store statistics that are stable under transformations. That seems like an oxymoron. However I also agree with the above post that it currently seems too difficult to extend DataFrames with third party packages and we should be wary of feature creep.
I’m not all that comfortable with calling statistics “metadata” either but it doesn’t much affect the discussion. It would still be nice to store them in a system that knows how they should transform and whether they should be invalidated. Besides that “statistics” includes pretty much anything that depends on the values in the column. That doesn’t leave a lot, again considering that there’s already a lot of important implementation stuff that’s just part of the AbstractVector.
I’m not suggesting that they would transform trivially, I’m suggesting you’d need the flexibility to define rules for their transformations. That the metadata proposal provides no route to doing this to me merely demonstrates why we don’t need it.
That’s certainly not true in general. It is very common to have to use some statistic very frequently and need to avoid the O(n) (or worse) cost of constantly recomputing them from scratch. Any functional of the data is technically a statistic.
I think as long as propagation is opt-in and there is some guidance about what is appropriate metadata and what is not, then it is fine. That defines a mostly clear set of rules for people to follow, which is no different than in any other programming context. We have plenty of cases in Julia, outside of DataFrames.jl, where bad/unexpected behavior can arise as a result of something being used improperly. I see metadata as falling in that camp.
To expand, earlier in the thread I also had the belief that the system would be rather hopeless because metadata is necessarily arbitrary. But perhaps the system that we can implement right now isn’t for arbitrary metadata. Perhaps it is only for metadata that satisfies some particular criteria. That’s ok, isn’t it?