From my database background, it seems natural to me to have both metadata associated with the table as a whole, and with each column.
Why is there a debate over one or the other? They are not at all mutually exclusive.
I can see metadata that makes sense for individual values (e.g. units of measurements), some for a whole column (āall counties in the USā) and some for a whole table. For quite a lot of stuff it is not super clear to me at which level that info belongsā¦
File formats seem to support varying types of metadata. Some support table level metadata, some column level metadata, and it is not clear to me whether any support value level metadata.
I think my gut reaction to this would be to support something that allows us to support the stuff that can be stored in the various file formats, and then stop thereā¦ Except for column selection, it seems to me that any column level metadata really canāt be preserved in a meaningful way through query operations. Even something like a filter operation should probably not preserve the metadata of a column (e.g. the column metadata says ācomplete list of US countiesā).
Apart from the question about the logical data model here, there is then of course the question of implementation. It seems to me that for units of measures one would probably want to encode that in the value type. Beyond that, Iām not really sureā¦
I think for Query.jl and TableTraits.jl maybe the following would most make sense for now:
- Query.jl operators just ignore any metadata, i.e. you lose your metadata when you pipe it through any operator. Except if the metadata is embedded in the column type (say via a number type that encodes units). It just seems almost impossible to figure out what the right semantics would be otherwise.
- For TableTraits.jl I could imagine adding an optional interface for metadata. That would mainly enable two scenarios: loading and saving from disc could support meta data (if a file format supports it), and conversion between different table types could preserve metadata. Oh, and I guess a plotting library could in theory use the metadata, as long as it uses a table type or something from disc directlyā¦
@nalimilan: If it is just labels / explanations, I donāt see any problems. But in that case, why not just create a label
field?
@pdeffebach: Dropping metadata when in doubt (eg after a transformation) is a reasonable approach.
The question is not one or the other, itās how to implement column-specific meta-data.
On the contrary, column meta-data should always be preserved when selecting a subset of rows (this is what e.g. Stata does). Without this, meta-data wouldnāt be very useful, as it wouldnāt be available in many actual data sets people work with, which are often subsets of a larger original dataset. I donāt think the counter-example āComplete list of US countiesā applies: it isnāt column-specific meta-data, it describes the whole data set (i.e. you have one row for each county). Typical meta-data describes the contents of a column (āUS countyā) ā and we can make this a rule if we want.
I think it would make sense to be able to preserve column meta-data when selecting columns without transforming them. Of course in terms of implementation, thatās not trivial. But in terms of semantics I think itās clear what the behavior should be. Otherwise weāre going to get complaints from people who wonder why their meta-data was lost when subsetting using Query while it is preserved when doing the same operation with getindex
.
Because labels are not the only kind of meta-data you may want to store. As I noted survey questions or notes are also frequently useful, and for internal use one may want to define special keys. So having a more general system sounds useful.
One thing I still donāt understand is why this needs to be implemented by DataFrames
, not just another <: AbstractVector
type that wraps another <: AbstractVector
+ metadata, and implements a
metadata(::AbstractVector) = nothing
(or similar) fallback. It is my understanding that
- this would work with
DataFrames
out of the box, like all<: AbstractVector
s do, - subsetting either columns or rows (with
Base.getindex
etc implemented) would just work, too, - transformations and maps would automatically drop the metadata.
Maybe Base.similar
should propagate the metadata though for this vector type unless element types differ.
I think that the need for metadata becomes compelling when reading data from a source (a website, a file, a database, etc.), pass them to another environment (e.g. data analysis package, plotting package, etc.), and ensure that all the relevant informations are delivered, not just the numbers.
Therefore, an array-based approach (such as MetadataArrays) seems to me at best limited, and my proposal was indeed to attach metadata to ācompleteā data structures such as a DataFrame
. Clearly other data structures can benefit as well from metadata facilities, but DataFrame
is the first that pops into my mind.
I agree with @davidanthoff:
For instance, supporting the FITS file format amounts exactly to what Iām proposing here and can be readily implemented with the metadata support implemented in my PR.
Finally, besides the obvious unit/plot label, It seems to me that no important use case for column metadata has been illustrated here. Only @kevbonham provided a use case:
But I didnāt understood what do the rows of such table containā¦
Just to clarify: I was talking about an implementation of adding metadata to DataFrame
s by putting it in the columns as an <: AbstractVector
wrapper.
I see this as a more modular and generic approach, that would work vectors contained in an outside of DataFrame
s.
@Tamas_Papp : I think it is maybe a good time to decide on the semantics of this <: AbstractVector
type. Iāve implemented more or less exactly what you describe in MetadataArrays.jl, with two differences:
- It errors instead of returning
nothing
as a fallback. Iād tend to agree with you that returningnothing
makes more sense. -
Base.similar
always keeps metadata in my implementation. What do you think is best? Always keep it, never keep it (and putnothing
instead ?), or only keep it if element type is preserved, putnothing
otherwise?
I have given a few potential reasons above. Have you missed them? Iām not saying they are totally decisive, but at least they show things are more complex than that.
Obviously the array-based approach would only be useful to store column meta-data. Something else would still be needed for global meta-data. But that doesnāt mean array-based meta-data arenāt useful. Letās not mix these two design decisions which are completely orthogonal.
Iāve posted several examples of use cases and implementations in R, and Stata also supports column labels, so I think itās clear itās considered useful by a lot of people. Anyway thereās no need to oppose column meta-data to global meta-data, so if the former arenāt useful for you, you can just concentrate on the latter.
I guess it depends on what metadata
returns, but if it returns a Dict
as in MetaDataArrays currently, then it should either throw an error or return an empty dict as a fallback. But really thatās a secondary issue that would better be discussed in the GitHub project, we already have too many questions in this topic.
Iād say similar
should drop meta-data. āsimilarā isnāt āidenticalā: itās used to create a vector of the same type and shape, but it can be filled with anything. That the input is āGDP per capitaā doesnāt mean the output will also be āGDP per capitaā (or youād just call copy
). As a data point, similar(::CategoricalArray)
does not preserve levels, because you are likely to put completely different data in the resulting array.
-
It should return what can be construed as valid metadata in the API. Eg if metadata is a
Dict{Any,Any}
, it should returnDict{Any,Any}()
. -
I use
Base.similar
for constructing empty containers when an algorithm is inconvenient to express otherwise, so I would keep the metadata.
If you are referring to this, I did not miss it, I just had the impression that you were arguing that while the implementation poses some challenges, it has the advantage of allowing metadata outside dataframes.
It is likely that I did not express myself clearly, but I was trying to ask about metadata for to the whole dataframe, not tied to columns.
Fair enough!
Moreover, in both cases we deal with metadata, but the implementation details can be quite different.
Hence I propose to clearly state whether a given comment/suggestion/proposal aims to deal with column or global metadata or both.
Mine deals with both, but is admittently limited to DataFrames
object.
Iām concerned that having the metadata attached to the columns themselves is not generic, it only makes sense for a column major store (and most all databases are store rows, except for some of the ones like Vertica).
For data coming from a database, youād have metadata about the database, the tables in the database, and the columns in each table (which youād get back as a vector, with the type information, labels etc. about each column in the rows).
If you do want to select some subset of the columns, itās very easy in Julia to index that vector of metadata with the range or a vector of indexes to get that subset, just as youād do with the columns themselves.
Maybe whatās needed is a āAbstractTableā (or is there already one?), which has an API for metadata,
not limited to DataFrames
, handling both the table- and column- metadata.
Having a delegation mechanism like:
mutable struct MetadataContainer{T}
meta::Dict
data::T
end
# add metadata to DataFrame
@delegate DataFrame MetadataContainer{DataFrame} ".data"
# add metadata to arrays
@delegate AbstractArray MetadataContainer{AbstractArray} ".data"
# add metadata to Int numbers
@delegate AbstractArray MetadataContainer{Int} ".data"
would be the best choice. After all, itās just a matter of adding appropriate entries in the dispatch table, i.e.:
showcols(p1::MetadataContainer{DataFrame}) = showcols(p1.data)
This can be done through TypedDelegation, although all methods must be listed one by one.
A macro which automatically does this job, based on the methods returned by methodswith
would be really useful. Do something similar already exists? Or will the core Julia ever support delegation?
Sorry but I donāt follow. Storage is orthogonal to the concept of ācolumnsā (values with homogeneous types along on coordinate). Column-specific metadata would make sense even if one stored rows as tuples.
Yes, that would be nice, but currently no such thing exists AFAIK.
Note, however, that it wouldnāt magically solve all problems: if you want getindex
to preserve meta-data, you need to implement custom methods which delegate to the wrapped vector/data frame, and rewrap the result in a MetadataContainer
. But at least by default all methods would work (and discard meta-data).
Let me be concrete here. I work with human biological data, specifically microbiomes. A typical dataset is a series (on the order of hundreds) of samples, each of which has thousands to tens of thousands of measurements (the relative abundances of various microbial species). The way the data is currently structured is as a sparse matrix with samples as columns and rows as microbial features. Eg:
using DataFrames
taxa = DataFrame(species=["species_$x" for x in 1:10],
sample1=rand(10),
sample2=rand(10),
sample3=rand(10))
Many of these samples also have a ton of relevant clinical metadata associated with them, such as patient age, disease diagnosis, medication history etc. Usually, this comes to me in a separate table where rows are samples and columns are the type of information
metadata = DataFrame(sample=["sample1", "sample2", "sample3"],
age_in_years=[30, 25, 57],
diagnosis=["Healthy Control", "RA", missing],
antibiotics=[true, false, missing])
The metadata is often incomplete, and I only have certain information for certain samples, and some samples that donāt have any metadata. Some of my analyses donāt require any metadata, or only require some of it, so I often have a complex series of dicts and comprehentions that I have to do in order to subset data in various ways. This throws all of the orders of things out of whack (Iāve been bitten by not keeping track of the order of arrays), so I often use dicts of dicts etc, but have to deal with missing
s all the time and itās an utter mess. Iād say trying to cope with this stuff is between 70 and 85% of my analysis time.
So to be clear and restate things - I need column-level metadata. The implementation details (whether itās stored with the DataFrame or attached to the vectors themselves) matter less to me, but I would need the metadata to follow columns in subseting and views (Iāve recently started mixing in Query, but I mostly do it with indexing and view()
.
Another wrinkle to consider - I rather like the idea of tying this stuff to a generic table implementation, since SpatialEcology (which my package is now based off of) we use CommMatrix
es that are wrappers around sparse matrices that have special functions associated with them. Having a way to use Metadata generically across many types of data representations (including DataTables, SparseMatricesā¦ basically everything supported by IterableTables) would be really lovely (though also Iām sure a lot of work that would save me a heap of time, but that I donāt have the time to implement well).
Yeah, this would be amazing. @mkborregaard wrote a macro in SpatialEcology that seems similar to the type delegation you linked to, but also requires explicitly pointing out the methods you want forwarded.
I suspect itās slightly (or much) more complicated because methods that operate on multiple arguments also need to be included.
I am not sure this is metadata; this should fit nicely in dataframes that describe various levels of the experiment, and should be amenable to formatting to ātidy dataā using join
. Am I missing something?