How to add metadata info to a DataFrame?

ExpandingMan · June 15, 2018, 6:34pm

I wasn’t arguing with that, I’m just wondering what the best way of going about it is (and not claiming to have the answer). Thanks for doing this, btw.

pdeffebach · June 15, 2018, 7:24pm

Thanks

Though I do think that metadata should be an essential part of DataFrames, you bring up a good point, in that it is currently impossible to implement a new type of AbstractDataFrame. join(::AbstractDataFrame, ::AbstractDataFrame), for example, actually calls the DataFrame constructor. I’m not sure its priority now to clarify a small set of functions that constitute the whole AbstractDataFrame api, but it should probably be on the radar.

nalimilan · June 15, 2018, 8:19pm

I think it makes sense to add this to DataFrame, given that the overhead should be negligible when there is no meta-data (and actually even when there is). A custom AbstractDataFrame type would have to duplicate a lot of code just to add this feature.

ExpandingMan · June 18, 2018, 1:31pm

One argument is that if you use a separate AbstractDataFrame, there’s no limit on what metadata-related features you might add to it, whereas with a DataFrame that may or may not be true. (I suppose that depends on what people are thinking of adding.) Remember that in some cases it’s appropriate to create these things in the millions (e.g. returns from by operations) so even constructors should be kept as simple as possible.

pdeffebach · June 18, 2018, 4:47pm

You should take a look at my PR, because constructors are still something to work out. So far all new constructors just have metadata as an empty dict.

Another alternative is to use sparse vectors to store metadata, if it works out that they store less space. I’d be interested to hear your comments on it on github.

nalimilan · June 18, 2018, 7:51pm

I don’t think that’s a problem. If we set the metadata field to nothing by default, it will have a really low cost. And even if we set it to an empty Dict, it will be quite cheap. (BTW, for by, we should allow returning named tuples or arrays thereof, which will be much more efficient.)

ExpandingMan · June 28, 2018, 5:55pm

I’ve recently been thinking about another reason why I think this should in some part depend on other AbstractDataFrame implementations rather than just adding metadata to DataFrame (perhaps some of both).

I often find it very useful to dispatch on particular “datasets”. Lately I’ve been developing a pattern of having dataset Designators. Functions then dispatch on those and provide some metadata for me, among other things. I was thinking that it would be really cool if it were trivially easy to implement AbstractDataFrames, another abstract type T <: AbstractDataFrame or a parametric type that I can then dispatch on. I’ve been finding patterns like this to be very useful for writing code with easy to follow “pipelines” for reformatting badly formatted data.

kevbonham · June 29, 2018, 10:33am

Can you explain theDesignator idea a bit more? I’m intrigued, but don’t think I actually grok what you mean.

ExpandingMan · June 29, 2018, 2:14pm

In a previous life my data was almost always hierarchical, and sometimes I made them into C++ objects equipped with useful methods. Which methods were useful would depend on what the structure is like.

Now, I often have several different tables representing different data. I always have to do a significant amount of processing to get these into a useful form, what I have to do depends on what the table represents. In the best case scenario, these tables ultimately become more pedestrian Julia objects with some numeric and Array fields with appropriate methods, but part of the reason I have found it so hard to get away from tabular formats is not just because that’s what I’m given, but because it’s usually so difficult to get a straight answer on what the data actually represents and what I can do with it and the ability to do relational database operations can often partially mitigate the pain.

Anyway, just because the various different datasets aren’t exactly the same doesn’t mean that no code can be shared between them, but in Julia the fact that they are all DataFrames can make it more difficult than it normally would be to make code generic. For example, a typical pattern I often have is to just define lots of functions for loading an pre-processing that look eerily like they would in Python

function load_dataset_1()
    # code here
end

function load_datset_2()
    # other code here
end

function do_stuff_to_dataset_1(df::AbstractDataFrame)
    # more code here
end

function do_stuff_to_dataset_2(df::AbstractDataFrame)
    # some other things here
end

This pattern is terrible because it is very non-generic and comes with a very high risk of changes not propagating correctly. However, I’m increasingly replacing this pattern with something like

abstract type Designator end
struct DataSet1 <: Designator 
    # some metadata fields
end
struct DataSet2 <: Designator 
    # some other metadata fields
end

function load(des::Designator)
    df = loadraw(des)  # loads data from somewhere
    df = prep(des, df)  # initial 1-to-1 transformations
    df = aggregate(des, df)  # many-to-1 aggregations (extending DataFrames function)
    df = postprep(des, df)  # final 1-to-1 transformations
end

I find that this is usually much more conducive to writing generic code. In practice I’ve been noticing that I can typically share a great deal of code between loadraw, prep and postprep while aggregate is usually quite different for each dataset (other than always using by). (I’ve also left out joins here which can get a little more complicated.) I’ve also been finding this sort of pattern makes it much easier for me to keep track of what is happening to my data sets and to make fewer mistakes. Exactly what metadata the Designator holds depends on the application, but for example it may contain information needed to locate a file to load, some sort of date range or a database connection string.

If it were easy to create subtypes of AbstractDataFrame, it might be nice to combine this all into a single object. A rough example that I haven’t thought too carefully about:

abstract type ProjectDataFrame <: AbstractDataFrame end
struct DataSet1 <: ProjectDataFrame
    df::DataFrame
    # I don't have any great suggestions yet about how this should work,
    # but presumably this would hold a regular DataFrame among other things
end
struct DataSet2 <: ProjectDataFrame
    df::DataFrame
end

function load(df::ProjectDataFrame) # perhaps start with empty dataframe
    df = loadraw(df) # populate with actual data
    df = prep(df)
    df = aggregate(df)
    df = postprep(df)
end

# this isn't a real suggestion, just a demonstration of the pattern
colmetadata(df, :col1) # how metadata works can now depend on the dataset

This wouldn’t be the only way of doing it. Another possibility would be to have a parametric type like DataFrameWithMetadata{D<:Designator} or some combination thereof. I don’t have strong opinions about what the right way of doing it is as I’ve been developing this pattern relatively recently.

So the gist of what I’m saying is this: Yes, it’s nice to have the ability to keep metadata with DataFrames, but it’s even more important to define bahavior for tables differently depending on their metadata. Simply keeping some Dicts with your columns is fine but it doesn’t address this issue and I fear may prove to be far too inflexible. As I tried to demonstrate a little bit here, yes, I do always have metadata that I have to keep track of, but it’s hard for me to say a priori how I will use it for any particular data set. I suspect this is part of the reason why including metadata within a data frame object of some kind isn’t a more common pattern.

Again, this approach also has the added benefit of keeping the DataFrame object itself as simple as possible, which has a lot of appeal for me as ultimately I work on numeric problems and I found the simple Julia DataFrames to be a joy to work with compared to the unwieldy pandas equivalents which held a whole bunch of numpy arrays which never seemed to be correctly typed, unhackable C methods and the inability to use generic stdlib tools because they are just too slow. (That’s not to say we’d suddenly have all these problems in Julia if we were to just add metadata, of course we wouldn’t.)

Liso · June 29, 2018, 3:56pm

ExpandingMan:

For example, a typical pattern I often have is to just define lots of functions for loading an pre-processing that look eerily like they would in Python
function load_dataset_1()
    # code here
end

function load_datset_2()
    # other code here
end

function do_stuff_to_dataset_1(df::AbstractDataFrame)
    # more code here
end

function do_stuff_to_dataset_2(df::AbstractDataFrame)
    # some other things here
end
This pattern is terrible because it is very non-generic and comes with a very high risk of changes not propagating correctly.

Wow! Do you really think this is pythonic way how to do what you want to do??

ExpandingMan · June 29, 2018, 4:03pm

It seems to be a common type of pattern that I see in Python code, I’m not sure if I’d call it “Pythonic”. It probably would have been better just to say that it is far less generic than Julia code often is. A better way to do things in Python would probably be to create classes that inherit from pandas dataframes, which I have occasionally seen done. I suppose with Python being an OO language inheritance would probably be more “Pythonic” than just writing some functions, whatever that means.

kevbonham · June 29, 2018, 5:36pm

This is a great explanation, thanks! Still trying to imagine the implications, but I think I might try to steal the idea for my own stuff . Really clever way to use julia’s unique strengths…

ExpandingMan · June 29, 2018, 5:42pm

Yes, please do this. I haven’t really been doing things this way long enough to work out all the implications myself, so having other people doing it and comparing notes would be really good.

My main point was that (for me at least) metadata is as much about behavior as it is about being able to access it!

Liso · June 30, 2018, 8:46am

You’ll really see same in Julia when it will be more popular and average coders will use it.

Yes. Solution you are proposing here is simply implementable in Python’s OO too. (I don’t say that it is true in other problems or solutions)

gcalderone · January 14, 2019, 4:39pm

Dear all,
this topic has been inactive for a few months, but I just wanted to add that I implemented a package to solve the first problem posted, namely how to add metadata information to a DataFrame.

The trick is to use composition and the ReusePatterns package to automatically forward all method calls from one type to another. The relevant code is:

using ReusePatterns
struct DataFrameMeta <: AbstractDataFrame
    p::DataFrame
    meta::Dict{String, Any}
    DataFrameMeta(args...; kw...) = new(DataFrame(args...; kw...), Dict{Symbol, Any}())
    DataFrameMeta(df::DataFrame) = new(df, Dict{Symbol, Any}())
end
meta(d::DataFrameMeta) = getfield(d,:meta)  # <-- new functionality added to DataFrameMeta
@forward((DataFrameMeta, :p), DataFrame)    # <-- reuse all existing functionalities

while the whole example can be found here.

With the above code we can use an object of type DataFrameMeta as if it was a simple DataFrame, but taking advantage of the new meta method to gain access to the associated metadata dictionary.

Comments are welcome!

croberts · October 25, 2019, 5:14pm

This may be a basic question…

Suppose I define a new type of dataframe (a Panel dataframe) akin to your DataFrameMeta dataframe. I forward all functions in the Dataframe module to my Panel type using your @forward macro.

I am doing this in a module of my own, and would like to export all of the functions defined by @forward. This way, I can import a Panel module and functions specific for panel dataframes.

I have no idea how to export the functions defined by @forward. I am not even sure if this is a good idea.

xiaodai · October 26, 2019, 2:57am

Would it be better to call it metadataframe, because there is already a dataframesmeta package

Topic		Replies	Views
Attaching simple metadata to types General Usage	1	339	October 21, 2021
DataFrames.jl: metadata Data package , dataframes , metadata	149	6315	April 24, 2023
Copy metadata between DataFrames Machine Learning mljlinearmodels	2	197	September 30, 2024
Add metadata to categorical array Data	4	132	July 16, 2024
Writing dataframe to arrow format with column metadata Data	6	493	October 6, 2023

How to add metadata info to a DataFrame?

Related topics