The discussion on metadata in DataFrames.jl has started 10 years ago, see this issue.
In DataFrames.jl we decided to finally settle the issue. The relevant PR is here.
The fact that metadata issue has been open for 10 years reflects the complexity of this decision. In the end we have three major options we can choose. In this post I would like to open a vote on which of the approaches the community would prefer most. All comments are welcome.
Let me summarize the options:
- Do not add metadata to DataFrames.jl; instead develop some other more generic mechanism for Tables.jl tables that DataFrames.jl could leverage;
- pros: this will be most composable, most Julian, and easiest to maintain from DataFrames.jl - as I will not add any complexity to the package;
- cons: it is not clear if it will ever happen; cf. there is a general-purpose GitHub - Tokazama/Metadata.jl: Generic interface for attaching metadata to stuff. package, but it is not Tables.jl specific; we want something that could seamlessly integrate with persistence options offered by packages like Arrow.jl or Parquet2.jl)
- Add metadata to
DataFrame
object but do not propagate it under transformations- pros: metadata will never be propagated incorrectly
- cons: if you want to propagate metadata, you have to manually do it as a follow up to transformation
- Use complex set of rules how metadata should be propagated (this is the functionality implemented in the Metadata on data frame and column level by bkamins · Pull Request #3055 · JuliaData/DataFrames.jl · GitHub PR, so you can check the documentation section to check the rules):
- pros: metadata gets propagated in cases when it makes sense to propagate it most of the time
- cons: sometimes metadata is incorrectly propagated, in which case you need to call
dropmetadata!
function on a data frame to remove metadata from it (fortunately this is a very cheap and easily chained operation - so it is relatively easy to fall back to option 2 if one does not want metadata propagation); essentially this design means that preferably only “very stable” metadata (like descriptive column label or source of data for the column) should be stored (some ecosystems store “fragile” metadata like column mean and such metadata could be invalidated when e.g. taking a subset of a data frame)
Could you please vote below what you prefer us to do? Thank you!
- No metadata in DataFrames.jl
- Add metadata to
DataFrame
, but never propagate metadata - Add metadata to
DataFrame
, and propagate it (accepting that it will be incorrect sometimes) - Other option (please comment )
0 voters
Your input will be much appreciated (and will finally help us to resolve the issue that has been open for 10 years now and we already failed several times to make a decision what to do).