DataFrames.jl: metadata

The examples brought up so far for which metadata would be needed don’t sound so convincing to me as to require deep changes in almost every function in DataFrames as Bogumił said. The “complex rules” sound a bit like, as long as a column keeps the same name, or its name is deliberately changed, the metadata should go along. But that doesn’t sound that much more useful than starting out with a separate dictionary and keeping that updated for all operations where deliberate changes are made. I mean, so far the most useful case brought up was for labeling a table, but does that information really need to travel through all DataFrames manipulations where it’s prone to desync with the data anyway? Either you make lots of changes to the dataframe and the metadata becomes useless, or you don’t and a separate dict would have sufficed. The couple of operations where it’s a bit inconvenient to drag a separate dict along would be joins and explicit variable renames.

Wouldn’t a compromise be something along the lines of adding a method to each dataframe function such that it works on a tuple of dataframe and abstractdict. In such cases the same tuple is returned with the metadata operations applied to the dict depending on the function. So that would be a “sidecar” metadata approach.

2 Likes

It would be more flexible. The difference is that Arrow.jl requires both keys and values of metadata to be strings, while in the current implementation in DataFrames.jl we allow values of metadata to be any objects (but we recommend them to be strings to ensure lossless saving).

As @bkamins already noted, people who don’t use metadata or have concrete use cases for it are not really concerned by this discussion since nothing will change for them. So please detail your use cases and how the design would affect them rather than make general comments.

The idea is that you would never store this kind of information as metadata. TBH the example you give is quite contrived. It would be more natural to just use “Sales of top firms (USD)” as the label, or even “Sales (USD)”. The fact that these are the top 3 firms is already indicated by the number of rows in the dataset.

Being able to propagate metadata automatically is such an important feature that it justifies making some assumptions about what type of metadata we want to support. IMO metadata should at least be propagated when taking subsets of rows or of columns, otherwise it’s too inconvenient to be useful. Users who want to use a different kind of metadata can handle it manually via a separate dict, custom vector types, a separate AbstractDataFrame type…

That said, I agree that propagating metadata with transform(df, col => f => col) is a bold decision which requires users to adapt a particular workflow where they rename variables instead of replacing them when their meaning change. I think experience will tell us whether that’s a good idea or not (we don’t want to consider the metadata propagation rules as stable to be able to adapt them in the future).

That’s a possibility, but having different propagation rules depending on the key would be super hard to document. Already the documentation of a single behavior is relatively complex to write. And actually having metadata that never propagates doesn’t sound super useful: you can store it in a separate dict. The uncertainty is not so much on the definition of the kinds of metadata, but on what users mean when they write e.g. transform(df, col => f => col): does f return the same variable, just fixing some outliers or replacing missing values? or does it return a completely different one? Note that this is just a particular case, we could perfectly propagate metadata in other cases but not in this one if we think it’s too unsafe.

Actually we have considered this, but that would be problematic for custom vector types. For example, a CategoricalArray wrapped in a MetadataArray would no longer behave as a CategoricalArray. Anyway we could switch to this kind of approach later if we want as it wouldn’t break the API.

There’s no reason that transforming a data frame many times would make the metadata get out of sync. When doing transform(df, ...), variables that are not modified should keep their metadata. Many kinds of metadata also still apply if you take a subset of rows. And so on. Keeping track of which columns have been retained after a transformation to update a separate dict would be doing work twice (or even worse, asDicts do not use the same API as the DataFrames).

3 Likes

On a side note: newcomers will get confused with some naming conventions if Dataframes.jl starts supporting metadata and DataFramesMeta.jl has nothing to do with metadata.

3 Likes

Arrow.jl uses column metadata for storing type information so that when it goes to deserialize the table later on, it can recover the Julia types:

julia> using Arrow, Dates

julia> input_table = (; col = [Nanosecond(1)])
(col = [Nanosecond(1)],)

julia> table = Arrow.Table(Arrow.tobuffer(input_table))
Arrow.Table with 1 rows, 1 columns, and schema:
 :col  Nanosecond

julia> getfield(table, :columns)[1].metadata
Base.ImmutableDict{String, String} with 2 entries:
  "ARROW:extension:metadata" => ""
  "ARROW:extension:name"     => "JuliaLang.Dates.Period"

It seems like this kind of metadata is likely not very interesting, e.g. if I do DataFrame(table), I’m not sure I’m interested in seeing JuliaLang.Dates.Period stuff. However, one could add other metadata that is more interesting. I guess maybe there should be some way for Arrow.jl to communicate which metadata is internal and which isn’t.

Indeed there is and you have just used it: ARROW prefix means reserved metadata namespace, see Arrow Columnar Format — Apache Arrow v14.0.1.

1 Like

I voted for option 3 and has read the documentation here: Metadata on data frame and column level by bkamins · Pull Request #3055 · JuliaData/DataFrames.jl · GitHub

I find this an excellent proposal, and very well thought out.
Eg the rule saying that column transformations keep the metadata if the column name is kept. It is a good heuristic, that if the user finds that the transformation does not warrant a change of column name, the metadata is still valid.

It would be great in the future to have some way to easily update metadata in the mini-language similarly to how column names are updated.

My main use case for this will be to add descriptions to columns.
In documentation of an analysis I find it important to give definitions of column names.
My current ad-hoc approach is something like this:

julia> ph = DataFrame(time = [0,12,23], pH = [7.0, 6.0, 6.2])
3×2 DataFrame
 Row │ time   pH      
     │ Int64  Float64 
─────┼────────────────
   1 │     0      7.0
   2 │    12      6.0
   3 │    23      6.2

julia> ph_desc = DataFrame(column=["time", "pH"], Description=["Time in hours since start of incubation", "pH in fermentation broth"])
2×2 DataFrame
 Row │ column  Description                       
     │ String  String                            
─────┼───────────────────────────────────────────
   1 │ time    Time in hours since start of inc…
   2 │ pH      pH in fermentation broth

This has many shortcoming over a solution inside DataFrames.
With metadata on DataFrames I could write (and even share) a function to print the description close to the data in a report.

4 Likes

One more point with regards to sales and log_sales. Without metadata, you still don’t know what sales refers to. If people aren’t renaming variables after transformations (which they should!), metadata gives the option to keep a “simplistic” naming scheme while also letting users keep track of which variables are logged and which are not (see my Stata workflow above about transformations).

With regards to overall utility, I want to also emphasize that in Economics, Stata .dta flies are the de-facto standard for distributing large datasets with many variables with short, opaque names. The following are variable names from the compustat data.

  [1] "gvkey"         "datadate"      "fyear"         "indfmt"       
  [5] "consol"        "popsrc"        "datafmt"       "tic"          
  [9] "cusip"         "conm"          "acctchg"       "acctstd"      
 [13] "acqmeth"       "adrr"          "ajex"          "ajp"          
 [17] "bspr"          "compst"        "curcd"         "curncd"       
 [21] "currtr"        "curuscn"       "final"         "fyr"          
 [25] "ismod"         "ltcm"          "ogm"           "pddur"        
 [29] "scf"           "src"           "stalt"         "udpl"     

When working with this data in Julia, I’ve tried the Dict approach, and while it works okay, it can easily get out of sync with joins and merges. Adding new variables with there own labels means modifying a separate Dict. It’s unworkable.

People using Stata don’t have this problem, as variable labels provide all the information they need.

1 Like

I think we can learn from other languages.

Base R has general “attributes” that can be added to all objects including data.frames and columns in data.frames. I find this similar to option 1 above.

This has lead to problems with other packages working on data.frames, as they sometimes will not propagate the attributes.

As an example mutate() started out dropping all attributes, but now preserves all attributes: mutate() drops attributes · Issue #1984 · tidyverse/dplyr · GitHub
In the current proposal, DataFrames is much more fine grained: keeping metadata on the DataFrame and on preserved columns, but dropping it on columns that get re-named.

An other example is interaction between group_by and mutate data.frame attributes are preserved on `mutate()` but dropped on `group_by |> mutate` · Issue #6100 · tidyverse/dplyr · GitHub giving fun stuff like this:

df <- data.frame(a=1)
attr(df, "caption") <- "Table 1"
library(dplyr)
library(purrr)
> df %>% attributes() %>% pluck("caption")
[1] "Table 1"
> df %>% mutate() %>% attributes() %>% pluck("caption")
[1] "Table 1"
> df %>% group_by(a) %>% attributes() %>% pluck("caption")
[1] "Table 1"
> df %>% group_by(a) %>% mutate() %>% attributes() %>% pluck("caption")
NULL

I think this argues for option 3: well designed heuristics inside the DataFrames package, and against option 1: a generic solution that also applies to DataFrames.
It could also argue for option 2: always drop metadata, but I think that will devalue the metadata feature, as metadata will have to be explicitly re-added after each transformation.

As the option as designed is fully optional, I find it reasonable that users using it are aware of the risks of incorrect metadata and drop or update metadata accordingly.

1 Like

Thanks for doing this. DataFrames has a lot of influence on what becomes the norm but there are a lot of other places we want to use something similar. I have thoughts about how this can be implemented in a general way to benefit more packages, but that conversation may take a while. In the mean time, I think the best way to keep progressing without hindering future progress is to just drop metadata eagerly.

So option 2 with a future 3.

1 Like

The effective use of metadata is key in many processing
scenarios in addition to DataFrames processing. Some
that come to mind:

  • Units can be metadata
  • Axis information for multidimensional data sets
  • Which dims are the image (x,y)
    • What type of pixel/channel data (RGB, Temp, time,…)
    • Calibration information
  • HDF5 has attribute metadata

I think more general support for metadata (as an interface,…?)
would be good and might inform further development across
the Julia packages ecosystem.

There are a lot of types of metadata, but I think in this situation that we are all talking about is some in memory data that can have multiple metadata values bound to it that is accessed and contextualized by a key.
So we can agree that metadata(obj) should return a collection that can access individual values with a key.
But that’s where things begin to diverge.
A DataFrame is an in memory dynamic data analysis tool.
Metadata should likely reflect that relationship by strongly avoiding dispatching on types.
If we have an image (represented as a multidimensional array), then we often want to annotate dimensions/axes with some additional data that should be inferrible at compile time.
For example, I think we usually want the return type of ImageCore.pixelspacing to be inferrible because it can play a role in high performance image transformations.
On the other hand, we may have metadata that provides a comment, like “dog”.
We could probably manage to create and infer cats and dogs lables, but is that really worth the extra compilation time and complexity when all we want is metadata(obj)[:comment] == "dog"?

So far we’ve discussed propagation (drop, copy, or share metadata across operations) and inferrability/parametric typing that is permissive to runtime and compile time optimizations.
The approach to solving this makes things more complex.
I think we could start compartmentalization of rules into metadata and operations based on indices, each value, and the entire data instance.
But then we need to figure out how we provide that context and at what point in the call graph this should come up.
I’m open to discussing how to solve this but I want it to be clear how complex this problem space is before anyone assumes that this discussion will solve the problem quickly and in a way that is universally pleasing.

That’s why I think the best approach for people to get metadata in DataFrames right now without future problems is just implementing metadata(df::DataFrame) = getfield(df, :metadata) with the assumption it can be some key-value collection.
Eagerly dropping metadata will allow us to let future solutions be an opt-in and not break stuff we do now.

Just to clarify, metadata is already opt-in with the current proposal. No one has to use metadata and there is a zero performance cost if one chooses not to use it.

My point is that if we copy now then later on we might have a more flexible solution to switch that and people will wonder where their metadata went.

Maybe someone can elaborate on the limitations of adding metadata to Tables.jl rather than DataFrames.jl? I would think Tables.jl could attach metadata to both tables and columns, then DataFrames could decide what to do with that metadata during transformations. DataFrames could provide conveniences for the user to be explicit.

transform(df, operation; metadata=:keep, colmetadata=Dict(colname1.desc => "New column description.")
join(df1,df2; on=:colname, metadata=:append, colmetadata=:drop)
1 Like

This is more or less what the current implementation is, with a few caveats.

Tables.jl is not an interface, not an implementation. It serves as a way for an object to communicate what functions it does and does not support. Tables.jl is designed to be a minimal interface: getting columns, iterating through columns and rows, etc. DataAPI.jl is a similar package which complements Tables.jl and provides a set of “foundational” data-related functions that other packages can extend. For example, every table-like object probably wants to implement nrow for the number of rows a table has.

What Bogumil’s PR does is define a set of functions that objects can “opt into” for accessing metadata. So new table-like objects which want to use metadata have a minimal API that they know to support. This includes metadata, colmetadata, hasmetadata and hascolmetadata.

So when you say

I would think Tables.jl could attach metadata to both tables and columns

You are correct in that a 3rd package is providing an API to organize the attaching of metadata. You are also correct whe nyou say

then DataFrames could decide what to do with that metadata during transformations

The DataFrames.jl is exactly this implementation, and propagation is a detail of implementation which is not defined in the DataAPI.jl metadata API.

Yes. DataFrames.jl could provide this kind of functionality control the propagation of metadata. My argument is that explicit propagation of metadata for every operation would be very annoying. Also, more limited metadata propagation doesn’t make code any easier to write or maintain.

Rather, the current proposal in DataFrames.jl is very good: propagate on joins and subsets, and on transformations where the source and destination columns match. It’s a good API and follows Stata closely, which I appreciate a lot.

1 Like

I agree with this whole-heartedly. That’s why we eventually want a more elaborate interface to handle this eventually.

We can’t know this statement is true without knowing what the final interface will be. For example, we might have should_copy_metadata get info from the table, meta-datum, or contextualizing key. Depending on how this is implemented we may need some default value. If the default is to drop metadata and we start this off assuming that metadata is copied, then pipelines will stop working without notice. If we do it the other way around then people will have unexpected propagation of metadata and copying stuff will slow stuff down but won’t break pipelines.

Most of these scenarios are already well-covered in Julia, though:

  • Units can be metadata

Use Unitful.jl.

  • Axis information for multidimensional data sets
  • Which dims are the image (x,y)

Use a keyed/named array package such as AxisKeys.jl or NamedDims.jl.

  • What type of pixel/channel data (RGB, Temp, time,…)

RGB - Colors.jl
Temp, time - builtin DateTime or Unitful.jl

I think what’s meant by metadata in the DataFrames context is a different beast.

Those are explicitly part of the type because they are common cases. Recreating a new array type for metadata that behaves similarly to those use cases is inefficient at the very least. My example of pixelspacing shares many behaviors to those arrays. If we had a metadata interface we wouldn’t need to create a new array type for this. We’d just change the propagation rules.

Since this discussion now has a wider audience than PRs on github, it may be helpful to provide a superficial implementation of an interface and describe why it is difficult to settle on an implementation. Let’s put this together with some basic types and traits that would provide support for the features we’re discussing.


abstract type MetaStyle end

struct MetaUnknown <: MetaStyle end

struct MetaAxes <: MetaStyle end

struct MetaDims <: MetaStyle end

struct MetaSelf <: MetaStyle end

struct MetaValues <: MetaStyle end

struct MetaDynamic <: MetaStyle
    style::MetaStyle
end

abstract type PropagationStyle end

struct PropagateCopy <: PropagationStyle end

struct PropagateDrop <: PropagationStyle end

struct PropagateShare <: PropagationStyle end

struct PropagateDynamic <: PropagationStyle
    style::PropagationStyle
end

People can add new types for special interactions but this provides basic support for additional contextualization of metadata and both runtime and compile-time optimizations. However, we need to figure out how this information is stored and accessed in relation to the data-metadata object. The following is one approach we could take that permits the traits above to provide specialized code and dynamic code.

struct MetaDatum{D,S,P}
    data::D
    style::S
    propagation::P
end

struct MetaData{K,V,S,P,D<:AbstractDict{K,MetaDatum{V,S,P}}} <: AbstractDict{K,V}
    data::D
end

# dictionary interface for accessing metadata values (assume we have the full interface defined)
Base.getindex(@nospecialize(x::MetaData), key) = getfield(x, :data)[key]

propagate_metadata(::PropagateDrop) = false
propagate_metadata(x::PropagateDynamic)  = !(getfield(x, :style) isa PropagateDrop)
propagate_metadata(@nospecialize x::PropagationStyle) = true
propagate_metadata(@nospecialize x::MetaDatum) = propagate_metadata(getfield(x, :propagation))

# methods for managing propagation of metadata when indexing
index_metadata(::MetaAxes, ::PropagateShare, @nospecialize(data), inds::Tuple) = map(view, data, inds)
index_metadata(::MetaAxes, ::PropagateCopy, @nospecialize(data), inds::Tuple) = map(getindex, data, inds)
function index_metadata(::MetaAxes, p::PropagateDynamic, @nospecialize(data), inds::Tuple)
    if p.style isa PropagateCopy
        map(getindex, data, inds)
    elseif p.style isa PropagateShare
        map(view, data, inds)
    else
        error("unsupported PropagationStyle $(p.style)")
    end
end

index_metadata(::MetaStyle, ::PropagateShare, @nospecialize(data), inds::Tuple) = data
index_metadata(::MetaStyle, ::PropagateCopy, @nospecialize(data), inds::Tuple) = copy(data)
function index_metadata(::MetaStyle, p::PropagateDynamic, @nospecialize(data), inds::Tuple)
    if p.style isa PropagateCopy
        copy(data)
    elseif p.style isa PropagateShare
        data
    else
        error("unsupported PropagationStyle $(p.style)")
    end
end
function index_metadata(md::MetaDatum, inds::Tuple)
    s = getfield(md, :style)
    p = getfield(md, :propagation)
    MetaDatum(index_metadata(s, p, getfield(md, :data), inds), s, p)
end
function index_metadata(@nospecialize(md::MetaData), inds::Tuple)
    MetaData(map(Base.Fix2(index_metadata, inds), Iterators.filter(propagate_metadata, md.data)))
end

Note the (potentially excessive) use of @nospecialize. We want to ensure that we can have MetaDatum{MetaDynamic,PropagateDynamic,Any} and not create new method instances when dispatching on the individual fields. Yet we can still get new method instances for new types so that inference won’t fail us when it’s important. This is a bit oversimplified, but I think it still serves the purpose of illustrating how simple it is to support metadata types like this:

struct MetaTable{P,M}  # assume the Table.jl API is properly defined for this:
    parent::P
    metadata::M
end

Base.parent(x::MetaTable) = getfield(x, :parent)

metadata(x::MetaTable) = getfield(x, :metadata)

function Base.getindex(x::MetaTable, r, c)
    MetaTable(parent(x)[r, c], index_metadata(metadata(x), (r, c)))
end

Most operations that need special reference to metadata can be categorized as some combination of indexing, reduction, joining/concatenation/merging, or dimension permutation. A set of simple rules like this slowly generalizes once we have a dedication graph, table, and array type for metadata.

I want to be clear that I’m not trying to sell any particular approach here because there are pros and cons to all implementations. It might be better to store the style and propagation information in parallel collections:

struct MetaData{K,V,D<:AbstractDict{K,V},S,P} <: AbstractDict{K,V}
    data::D
    styles::S
    propagations::P
end

But then there’s the issue of mapping styles and propagations to the keys of data.

Another approach is to put all of this on the the value type requiring unique meta-datum types to explicitly define their methods

index_metadata(datum, inds::Tuple) = datum

index_metadata(datum::MyAxesType, inds::Tuple) = datum[inds]

function Base.getindex(x::MetaTable, r, c)
    MetaTable(parent(x)[r, c], map(Base.Fix2(index_metadata, (r, c)), metadata(x)))
end

However, the loss of traits here makes it a bit difficult to explicitly specialize and de-specialize on datum.

Another solution is to just accept that the term “metadata” has a definition throughout the Julia ecosystem that is too inclusive for what is trying to be accomplished with DataFrame here. I’ve taken some time to look at LLVM’s metadata, Clojure’s metadata, R’s attributes, and (to a lesser extent) Haskell’s metadata. The way this is being described seems a lot more like R’s attributes where everything about it is dynamic and requires manual dispatch. I’ve seen the term “properties” used similarly in Julia (see ImageMetadata.jl).

I’ve rambled long enough here, so I’ll just end this by suggesting that the issue of propagation will be difficult to resolve if we don’t establish a common set of qualities that metadata has/permits.

2 Likes