How to add metadata info to a DataFrame?

Added an attempt at metadata here. I would be interested in people’s feedback, and since people will always be worried about performance, any help on making the code as performant as possible, especially for those that do not intend on using metadata in their work.

I definitely like that you seem to have kept all of the new metadata stuff completely out of the existing column stuff and the underlying AbstractVectors.

That said, wouldn’t it make more sense to have another AbstractDataFrame for this, like DataFrameWithMetdata (or a less horrible name)? I’m a bit wary of tacking new features onto DataFrame which are only relevant for a small minority of use cases, to me the extreme simplicity of DataFrames is a huge part of the appeal and I might even argue one of the reasons the package will still have for existing once JuliaDB matures. Another reservation I have about this concept is that it’s a bit hard to foresee the particular details of metadata use cases, so I’m not sure whether any of these attempts are both general enough to be applicable and specific enough to be useful. I wonder if energy is better spent making user-defined AbstractDataFrames easier to implement.


There is a lot of discussion about this already in this thread and others. I hope I have been able to convince even a few people that support for metadata living in dataframes would have a wide userbase. Everyone who uses stata uses metadata, and everyone who works with survey data or large datasets form the world bank or genomics needs metadata to stay organized. Its also a useful tool for document large amounts of data cleaning. I hope that my implementation adds as small a footprint as possible to the physics-based and other communities that use it.

I wasn’t arguing with that, I’m just wondering what the best way of going about it is (and not claiming to have the answer). Thanks for doing this, btw.


Though I do think that metadata should be an essential part of DataFrames, you bring up a good point, in that it is currently impossible to implement a new type of AbstractDataFrame. join(::AbstractDataFrame, ::AbstractDataFrame), for example, actually calls the DataFrame constructor. I’m not sure its priority now to clarify a small set of functions that constitute the whole AbstractDataFrame api, but it should probably be on the radar.

I think it makes sense to add this to DataFrame, given that the overhead should be negligible when there is no meta-data (and actually even when there is). A custom AbstractDataFrame type would have to duplicate a lot of code just to add this feature.

1 Like

One argument is that if you use a separate AbstractDataFrame, there’s no limit on what metadata-related features you might add to it, whereas with a DataFrame that may or may not be true. (I suppose that depends on what people are thinking of adding.) Remember that in some cases it’s appropriate to create these things in the millions (e.g. returns from by operations) so even constructors should be kept as simple as possible.

You should take a look at my PR, because constructors are still something to work out. So far all new constructors just have metadata as an empty dict.

Another alternative is to use sparse vectors to store metadata, if it works out that they store less space. I’d be interested to hear your comments on it on github.

I don’t think that’s a problem. If we set the metadata field to nothing by default, it will have a really low cost. And even if we set it to an empty Dict, it will be quite cheap. (BTW, for by, we should allow returning named tuples or arrays thereof, which will be much more efficient.)

1 Like

I’ve recently been thinking about another reason why I think this should in some part depend on other AbstractDataFrame implementations rather than just adding metadata to DataFrame (perhaps some of both).

I often find it very useful to dispatch on particular “datasets”. Lately I’ve been developing a pattern of having dataset Designators. Functions then dispatch on those and provide some metadata for me, among other things. I was thinking that it would be really cool if it were trivially easy to implement AbstractDataFrames, another abstract type T <: AbstractDataFrame or a parametric type that I can then dispatch on. I’ve been finding patterns like this to be very useful for writing code with easy to follow “pipelines” for reformatting badly formatted data.

1 Like

Can you explain theDesignator idea a bit more? I’m intrigued, but don’t think I actually grok what you mean.

1 Like

In a previous life my data was almost always hierarchical, and sometimes I made them into C++ objects equipped with useful methods. Which methods were useful would depend on what the structure is like.

Now, I often have several different tables representing different data. I always have to do a significant amount of processing to get these into a useful form, what I have to do depends on what the table represents. In the best case scenario, these tables ultimately become more pedestrian Julia objects with some numeric and Array fields with appropriate methods, but part of the reason I have found it so hard to get away from tabular formats is not just because that’s what I’m given, but because it’s usually so difficult to get a straight answer on what the data actually represents and what I can do with it and the ability to do relational database operations can often partially mitigate the pain.

Anyway, just because the various different datasets aren’t exactly the same doesn’t mean that no code can be shared between them, but in Julia the fact that they are all DataFrames can make it more difficult than it normally would be to make code generic. For example, a typical pattern I often have is to just define lots of functions for loading an pre-processing that look eerily like they would in Python

function load_dataset_1()
    # code here

function load_datset_2()
    # other code here

function do_stuff_to_dataset_1(df::AbstractDataFrame)
    # more code here

function do_stuff_to_dataset_2(df::AbstractDataFrame)
    # some other things here

This pattern is terrible because it is very non-generic and comes with a very high risk of changes not propagating correctly. However, I’m increasingly replacing this pattern with something like

abstract type Designator end
struct DataSet1 <: Designator 
    # some metadata fields
struct DataSet2 <: Designator 
    # some other metadata fields

function load(des::Designator)
    df = loadraw(des)  # loads data from somewhere
    df = prep(des, df)  # initial 1-to-1 transformations
    df = aggregate(des, df)  # many-to-1 aggregations (extending DataFrames function)
    df = postprep(des, df)  # final 1-to-1 transformations

I find that this is usually much more conducive to writing generic code. In practice I’ve been noticing that I can typically share a great deal of code between loadraw, prep and postprep while aggregate is usually quite different for each dataset (other than always using by). (I’ve also left out joins here which can get a little more complicated.) I’ve also been finding this sort of pattern makes it much easier for me to keep track of what is happening to my data sets and to make fewer mistakes. Exactly what metadata the Designator holds depends on the application, but for example it may contain information needed to locate a file to load, some sort of date range or a database connection string.

If it were easy to create subtypes of AbstractDataFrame, it might be nice to combine this all into a single object. A rough example that I haven’t thought too carefully about:

abstract type ProjectDataFrame <: AbstractDataFrame end
struct DataSet1 <: ProjectDataFrame
    # I don't have any great suggestions yet about how this should work,
    # but presumably this would hold a regular DataFrame among other things
struct DataSet2 <: ProjectDataFrame

function load(df::ProjectDataFrame) # perhaps start with empty dataframe
    df = loadraw(df) # populate with actual data
    df = prep(df)
    df = aggregate(df)
    df = postprep(df)

# this isn't a real suggestion, just a demonstration of the pattern
colmetadata(df, :col1) # how metadata works can now depend on the dataset

This wouldn’t be the only way of doing it. Another possibility would be to have a parametric type like DataFrameWithMetadata{D<:Designator} or some combination thereof. I don’t have strong opinions about what the right way of doing it is as I’ve been developing this pattern relatively recently.

So the gist of what I’m saying is this: Yes, it’s nice to have the ability to keep metadata with DataFrames, but it’s even more important to define bahavior for tables differently depending on their metadata. Simply keeping some Dicts with your columns is fine but it doesn’t address this issue and I fear may prove to be far too inflexible. As I tried to demonstrate a little bit here, yes, I do always have metadata that I have to keep track of, but it’s hard for me to say a priori how I will use it for any particular data set. I suspect this is part of the reason why including metadata within a data frame object of some kind isn’t a more common pattern.

Again, this approach also has the added benefit of keeping the DataFrame object itself as simple as possible, which has a lot of appeal for me as ultimately I work on numeric problems and I found the simple Julia DataFrames to be a joy to work with compared to the unwieldy pandas equivalents which held a whole bunch of numpy arrays which never seemed to be correctly typed, unhackable C methods and the inability to use generic stdlib tools because they are just too slow. (That’s not to say we’d suddenly have all these problems in Julia if we were to just add metadata, of course we wouldn’t.)

1 Like

Wow! Do you really think this is pythonic way how to do what you want to do??

It seems to be a common type of pattern that I see in Python code, I’m not sure if I’d call it “Pythonic”. It probably would have been better just to say that it is far less generic than Julia code often is. A better way to do things in Python would probably be to create classes that inherit from pandas dataframes, which I have occasionally seen done. I suppose with Python being an OO language inheritance would probably be more “Pythonic” than just writing some functions, whatever that means.

1 Like

This is a great explanation, thanks! Still trying to imagine the implications, but I think I might try to steal the idea for my own stuff :laughing: . Really clever way to use julia’s unique strengths…


Yes, please do this. I haven’t really been doing things this way long enough to work out all the implications myself, so having other people doing it and comparing notes would be really good.

My main point was that (for me at least) metadata is as much about behavior as it is about being able to access it!


You’ll really see same in Julia when it will be more popular and average coders will use it. :stuck_out_tongue:

Yes. Solution you are proposing here is simply implementable in Python’s OO too. (I don’t say that it is true in other problems or solutions)

Dear all,
this topic has been inactive for a few months, but I just wanted to add that I implemented a package to solve the first problem posted, namely how to add metadata information to a DataFrame.

The trick is to use composition and the ReusePatterns package to automatically forward all method calls from one type to another. The relevant code is:

using ReusePatterns
struct DataFrameMeta <: AbstractDataFrame
    meta::Dict{String, Any}
    DataFrameMeta(args...; kw...) = new(DataFrame(args...; kw...), Dict{Symbol, Any}())
    DataFrameMeta(df::DataFrame) = new(df, Dict{Symbol, Any}())
meta(d::DataFrameMeta) = getfield(d,:meta)  # <-- new functionality added to DataFrameMeta
@forward((DataFrameMeta, :p), DataFrame)    # <-- reuse all existing functionalities

while the whole example can be found here.

With the above code we can use an object of type DataFrameMeta as if it was a simple DataFrame, but taking advantage of the new meta method to gain access to the associated metadata dictionary.

Comments are welcome!


This may be a basic question…

Suppose I define a new type of dataframe (a Panel dataframe) akin to your DataFrameMeta dataframe. I forward all functions in the Dataframe module to my Panel type using your @forward macro.

I am doing this in a module of my own, and would like to export all of the functions defined by @forward. This way, I can import a Panel module and functions specific for panel dataframes.

I have no idea how to export the functions defined by @forward. I am not even sure if this is a good idea.

1 Like

Would it be better to call it metadataframe, because there is already a dataframesmeta package

1 Like