TBH I don’t see propagation as a problem in that case. Like @pdeffebach, I consider it a strict improvement to provide the column label “gender” and not only “gdr”. That doesn’t tell you whose gender this refers to, but it doesn’t tell you something incorrect nor misleading either.
Actually we could imagine automatically adding details to column labels about where the column comes from when performing joins. Something like “$column_label ($source_df_label)”. Of course this would only be practical for short data frame labels, so maybe we should have a metadata key for long labels and one for short labels/names. This is the kind of thing we can discuss later, once the basics are settled.
Today, one of these things was being a reviewer of a forecasting system. All implemented in R (and this is OK - all the methods used are very well supported in R).
The system consists of many forecasting models and each of them has many features. You probably know where I am getting at now. In an interactive review session it was hard to discuss about the models and their implications by looking at variable names only. There were constantly some extra questions about the data (those pesky “notes” we discuss in this thread). I wish we had a convenient way to lookup “notes” metadata in the process (and, incidentally, one of the issues were units that @Nathan_Boyer mentioned - some values were counted in units, and other in thousands of units and it was not reflected in names of the variables).
Now, let us even assume that the person preparing the data was not super careful and initially not all metadata were correct. Upon inspection we would fix it. Next time we would check that metadata it would be correct. Here the crucial point is that if you have a massive number of variables you keep asking about the same variable many times (it is simply impossible to remember everything).
Now I can imagine that one would keep “data dictionary” decoupled from the data frame. The problem is that, in my experience, it is even more prone to being “out of sync” with the data than when metadata is attached to a data frame.
In my world, statistics about the data are not metadata. Metadata is only the descriptive stuff that helps to transmit information that facilities the interpretation of the data that can’t be otherwise derived from the data. Global metadata are instrument settings, sample information, experimental method, when, where, why, how, by whom etc. Column-wise metadata helps someone else understand what the data in a column means. If I’m combining two DataFrames and the columns have the same name but different meanings (different metadata) there is something unholy going on.
I’d prefer merging DataFrames to maintain only the global and column-wise metadata that both DataFrames hold in common (meaning same name and same value.) This way if the metadata disagrees, it isn’t maintained.
in the BI ( = Business Intelligence) world …
there is some interesting metadata we can implement as core metadata.
for example:
aggregate rule
" ~ … To resolve this, they need to define an aggregate rule for the semi-additive measure “Total Requests”. A semi-additive measure is a measure that is to be summed for some dimensions, but should not be summed across some other dimensions. For the dimensions over which the measure is not additive, a different aggregation rule must be specified." ( via )
The proposed design is now described in DataAPI.jl level in add `metadata` by bkamins · Pull Request #48 · JuliaData/DataAPI.jl · GitHub along with example reference implementation in tests (as discussed the API is minimal and maximally generic as we try to not to introduce any restrictions for the future). After this PR is finalized I will move the API to DataFrames.jl (which will contain DataFrames.jl specific solutions - especially the :notes style implementation).
The third step will be to add higher-level convenience functions for working with metadata (I will discuss with @pdeffebach where to put them).
Koka is more a research language, e.g. cyclic data structures are not supported. I expect scientific/engineering requirements would have to involve complex integration of software components, like www infrastructure requirements have demonstrated, so there should be some point in time that Julia community get to be aware of the Dependency Injection / Inversion of Control techniques, which will eventually lead to effect systems.
Metadata implementation in DataAPI.jl and DataFrames.jl is out on main. You can read here about the rules. Some testing before the DataFrames.jl 1.4 release is welcome.
Note that following the discussion in this thread metadata API allows to specify style of metadata that affects how it is handled under transformations. Currently :default-style (no propagation except copy of a whole table) and :note-style (metadata is propagated) are implemented.
However, because of this flexibility the API is relatively complex. For this reason I will create TableMetadataTools.jl package that will make most common metadata operations easier.
nice, informative post, just as I’m running into metadata transformation considerations.
btw I’m surprised you didn’t do more pretty printing of your variable values, to include units and a controllable number of significant digits.