How to add metadata info to a DataFrame?

dataframes

#81

I wasn’t arguing with that, I’m just wondering what the best way of going about it is (and not claiming to have the answer). Thanks for doing this, btw.


#82

Thanks

Though I do think that metadata should be an essential part of DataFrames, you bring up a good point, in that it is currently impossible to implement a new type of AbstractDataFrame. join(::AbstractDataFrame, ::AbstractDataFrame), for example, actually calls the DataFrame constructor. I’m not sure its priority now to clarify a small set of functions that constitute the whole AbstractDataFrame api, but it should probably be on the radar.


#83

I think it makes sense to add this to DataFrame, given that the overhead should be negligible when there is no meta-data (and actually even when there is). A custom AbstractDataFrame type would have to duplicate a lot of code just to add this feature.


#84

One argument is that if you use a separate AbstractDataFrame, there’s no limit on what metadata-related features you might add to it, whereas with a DataFrame that may or may not be true. (I suppose that depends on what people are thinking of adding.) Remember that in some cases it’s appropriate to create these things in the millions (e.g. returns from by operations) so even constructors should be kept as simple as possible.


#85

You should take a look at my PR, because constructors are still something to work out. So far all new constructors just have metadata as an empty dict.

Another alternative is to use sparse vectors to store metadata, if it works out that they store less space. I’d be interested to hear your comments on it on github.


#86

I don’t think that’s a problem. If we set the metadata field to nothing by default, it will have a really low cost. And even if we set it to an empty Dict, it will be quite cheap. (BTW, for by, we should allow returning named tuples or arrays thereof, which will be much more efficient.)


#87

I’ve recently been thinking about another reason why I think this should in some part depend on other AbstractDataFrame implementations rather than just adding metadata to DataFrame (perhaps some of both).

I often find it very useful to dispatch on particular “datasets”. Lately I’ve been developing a pattern of having dataset Designators. Functions then dispatch on those and provide some metadata for me, among other things. I was thinking that it would be really cool if it were trivially easy to implement AbstractDataFrames, another abstract type T <: AbstractDataFrame or a parametric type that I can then dispatch on. I’ve been finding patterns like this to be very useful for writing code with easy to follow “pipelines” for reformatting badly formatted data.


#88

Can you explain theDesignator idea a bit more? I’m intrigued, but don’t think I actually grok what you mean.


#89

In a previous life my data was almost always hierarchical, and sometimes I made them into C++ objects equipped with useful methods. Which methods were useful would depend on what the structure is like.

Now, I often have several different tables representing different data. I always have to do a significant amount of processing to get these into a useful form, what I have to do depends on what the table represents. In the best case scenario, these tables ultimately become more pedestrian Julia objects with some numeric and Array fields with appropriate methods, but part of the reason I have found it so hard to get away from tabular formats is not just because that’s what I’m given, but because it’s usually so difficult to get a straight answer on what the data actually represents and what I can do with it and the ability to do relational database operations can often partially mitigate the pain.

Anyway, just because the various different datasets aren’t exactly the same doesn’t mean that no code can be shared between them, but in Julia the fact that they are all DataFrames can make it more difficult than it normally would be to make code generic. For example, a typical pattern I often have is to just define lots of functions for loading an pre-processing that look eerily like they would in Python

function load_dataset_1()
    # code here
end

function load_datset_2()
    # other code here
end

function do_stuff_to_dataset_1(df::AbstractDataFrame)
    # more code here
end

function do_stuff_to_dataset_2(df::AbstractDataFrame)
    # some other things here
end

This pattern is terrible because it is very non-generic and comes with a very high risk of changes not propagating correctly. However, I’m increasingly replacing this pattern with something like

abstract type Designator end
struct DataSet1 <: Designator 
    # some metadata fields
end
struct DataSet2 <: Designator 
    # some other metadata fields
end

function load(des::Designator)
    df = loadraw(des)  # loads data from somewhere
    df = prep(des, df)  # initial 1-to-1 transformations
    df = aggregate(des, df)  # many-to-1 aggregations (extending DataFrames function)
    df = postprep(des, df)  # final 1-to-1 transformations
end

I find that this is usually much more conducive to writing generic code. In practice I’ve been noticing that I can typically share a great deal of code between loadraw, prep and postprep while aggregate is usually quite different for each dataset (other than always using by). (I’ve also left out joins here which can get a little more complicated.) I’ve also been finding this sort of pattern makes it much easier for me to keep track of what is happening to my data sets and to make fewer mistakes. Exactly what metadata the Designator holds depends on the application, but for example it may contain information needed to locate a file to load, some sort of date range or a database connection string.

If it were easy to create subtypes of AbstractDataFrame, it might be nice to combine this all into a single object. A rough example that I haven’t thought too carefully about:

abstract type ProjectDataFrame <: AbstractDataFrame end
struct DataSet1 <: ProjectDataFrame
    df::DataFrame
    # I don't have any great suggestions yet about how this should work,
    # but presumably this would hold a regular DataFrame among other things
end
struct DataSet2 <: ProjectDataFrame
    df::DataFrame
end

function load(df::ProjectDataFrame) # perhaps start with empty dataframe
    df = loadraw(df) # populate with actual data
    df = prep(df)
    df = aggregate(df)
    df = postprep(df)
end

# this isn't a real suggestion, just a demonstration of the pattern
colmetadata(df, :col1) # how metadata works can now depend on the dataset

This wouldn’t be the only way of doing it. Another possibility would be to have a parametric type like DataFrameWithMetadata{D<:Designator} or some combination thereof. I don’t have strong opinions about what the right way of doing it is as I’ve been developing this pattern relatively recently.

So the gist of what I’m saying is this: Yes, it’s nice to have the ability to keep metadata with DataFrames, but it’s even more important to define bahavior for tables differently depending on their metadata. Simply keeping some Dicts with your columns is fine but it doesn’t address this issue and I fear may prove to be far too inflexible. As I tried to demonstrate a little bit here, yes, I do always have metadata that I have to keep track of, but it’s hard for me to say a priori how I will use it for any particular data set. I suspect this is part of the reason why including metadata within a data frame object of some kind isn’t a more common pattern.

Again, this approach also has the added benefit of keeping the DataFrame object itself as simple as possible, which has a lot of appeal for me as ultimately I work on numeric problems and I found the simple Julia DataFrames to be a joy to work with compared to the unwieldy pandas equivalents which held a whole bunch of numpy arrays which never seemed to be correctly typed, unhackable C methods and the inability to use generic stdlib tools because they are just too slow. (That’s not to say we’d suddenly have all these problems in Julia if we were to just add metadata, of course we wouldn’t.)


#90

Wow! Do you really think this is pythonic way how to do what you want to do??


#91

It seems to be a common type of pattern that I see in Python code, I’m not sure if I’d call it “Pythonic”. It probably would have been better just to say that it is far less generic than Julia code often is. A better way to do things in Python would probably be to create classes that inherit from pandas dataframes, which I have occasionally seen done. I suppose with Python being an OO language inheritance would probably be more “Pythonic” than just writing some functions, whatever that means.


#92

This is a great explanation, thanks! Still trying to imagine the implications, but I think I might try to steal the idea for my own stuff :laughing: . Really clever way to use julia’s unique strengths…


#93

Yes, please do this. I haven’t really been doing things this way long enough to work out all the implications myself, so having other people doing it and comparing notes would be really good.

My main point was that (for me at least) metadata is as much about behavior as it is about being able to access it!


#94

You’ll really see same in Julia when it will be more popular and average coders will use it. :stuck_out_tongue:

Yes. Solution you are proposing here is simply implementable in Python’s OO too. (I don’t say that it is true in other problems or solutions)


#95

Dear all,
this topic has been inactive for a few months, but I just wanted to add that I implemented a package to solve the first problem posted, namely how to add metadata information to a DataFrame.

The trick is to use composition and the ReusePatterns package to automatically forward all method calls from one type to another. The relevant code is:

using ReusePatterns
struct DataFrameMeta <: AbstractDataFrame
    p::DataFrame
    meta::Dict{String, Any}
    DataFrameMeta(args...; kw...) = new(DataFrame(args...; kw...), Dict{Symbol, Any}())
    DataFrameMeta(df::DataFrame) = new(df, Dict{Symbol, Any}())
end
meta(d::DataFrameMeta) = getfield(d,:meta)  # <-- new functionality added to DataFrameMeta
@forward((DataFrameMeta, :p), DataFrame)    # <-- reuse all existing functionalities

while the whole example can be found here.

With the above code we can use an object of type DataFrameMeta as if it was a simple DataFrame, but taking advantage of the new meta method to gain access to the associated metadata dictionary.

Comments are welcome!