I wasn’t arguing with that, I’m just wondering what the best way of going about it is (and not claiming to have the answer). Thanks for doing this, btw.
Thanks
Though I do think that metadata should be an essential part of DataFrames, you bring up a good point, in that it is currently impossible to implement a new type of AbstractDataFrame
. join(::AbstractDataFrame, ::AbstractDataFrame)
, for example, actually calls the DataFrame
constructor. I’m not sure its priority now to clarify a small set of functions that constitute the whole AbstractDataFrame
api, but it should probably be on the radar.
I think it makes sense to add this to DataFrame
, given that the overhead should be negligible when there is no meta-data (and actually even when there is). A custom AbstractDataFrame
type would have to duplicate a lot of code just to add this feature.
One argument is that if you use a separate AbstractDataFrame
, there’s no limit on what metadata-related features you might add to it, whereas with a DataFrame
that may or may not be true. (I suppose that depends on what people are thinking of adding.) Remember that in some cases it’s appropriate to create these things in the millions (e.g. returns from by
operations) so even constructors should be kept as simple as possible.
You should take a look at my PR, because constructors are still something to work out. So far all new constructors just have metadata
as an empty dict.
Another alternative is to use sparse vectors to store metadata, if it works out that they store less space. I’d be interested to hear your comments on it on github.
I don’t think that’s a problem. If we set the metadata
field to nothing
by default, it will have a really low cost. And even if we set it to an empty Dict
, it will be quite cheap. (BTW, for by
, we should allow returning named tuples or arrays thereof, which will be much more efficient.)
I’ve recently been thinking about another reason why I think this should in some part depend on other AbstractDataFrame
implementations rather than just adding metadata to DataFrame
(perhaps some of both).
I often find it very useful to dispatch on particular “datasets”. Lately I’ve been developing a pattern of having dataset Designator
s. Functions then dispatch on those and provide some metadata for me, among other things. I was thinking that it would be really cool if it were trivially easy to implement AbstractDataFrame
s, another abstract type T <: AbstractDataFrame
or a parametric type that I can then dispatch on. I’ve been finding patterns like this to be very useful for writing code with easy to follow “pipelines” for reformatting badly formatted data.
Can you explain theDesignator
idea a bit more? I’m intrigued, but don’t think I actually grok what you mean.
In a previous life my data was almost always hierarchical, and sometimes I made them into C++ objects equipped with useful methods. Which methods were useful would depend on what the structure is like.
Now, I often have several different tables representing different data. I always have to do a significant amount of processing to get these into a useful form, what I have to do depends on what the table represents. In the best case scenario, these tables ultimately become more pedestrian Julia objects with some numeric and Array
fields with appropriate methods, but part of the reason I have found it so hard to get away from tabular formats is not just because that’s what I’m given, but because it’s usually so difficult to get a straight answer on what the data actually represents and what I can do with it and the ability to do relational database operations can often partially mitigate the pain.
Anyway, just because the various different datasets aren’t exactly the same doesn’t mean that no code can be shared between them, but in Julia the fact that they are all DataFrame
s can make it more difficult than it normally would be to make code generic. For example, a typical pattern I often have is to just define lots of functions for loading an pre-processing that look eerily like they would in Python
function load_dataset_1()
# code here
end
function load_datset_2()
# other code here
end
function do_stuff_to_dataset_1(df::AbstractDataFrame)
# more code here
end
function do_stuff_to_dataset_2(df::AbstractDataFrame)
# some other things here
end
This pattern is terrible because it is very non-generic and comes with a very high risk of changes not propagating correctly. However, I’m increasingly replacing this pattern with something like
abstract type Designator end
struct DataSet1 <: Designator
# some metadata fields
end
struct DataSet2 <: Designator
# some other metadata fields
end
function load(des::Designator)
df = loadraw(des) # loads data from somewhere
df = prep(des, df) # initial 1-to-1 transformations
df = aggregate(des, df) # many-to-1 aggregations (extending DataFrames function)
df = postprep(des, df) # final 1-to-1 transformations
end
I find that this is usually much more conducive to writing generic code. In practice I’ve been noticing that I can typically share a great deal of code between loadraw
, prep
and postprep
while aggregate
is usually quite different for each dataset (other than always using by
). (I’ve also left out join
s here which can get a little more complicated.) I’ve also been finding this sort of pattern makes it much easier for me to keep track of what is happening to my data sets and to make fewer mistakes. Exactly what metadata the Designator
holds depends on the application, but for example it may contain information needed to locate a file to load, some sort of date range or a database connection string.
If it were easy to create subtypes of AbstractDataFrame
, it might be nice to combine this all into a single object. A rough example that I haven’t thought too carefully about:
abstract type ProjectDataFrame <: AbstractDataFrame end
struct DataSet1 <: ProjectDataFrame
df::DataFrame
# I don't have any great suggestions yet about how this should work,
# but presumably this would hold a regular DataFrame among other things
end
struct DataSet2 <: ProjectDataFrame
df::DataFrame
end
function load(df::ProjectDataFrame) # perhaps start with empty dataframe
df = loadraw(df) # populate with actual data
df = prep(df)
df = aggregate(df)
df = postprep(df)
end
# this isn't a real suggestion, just a demonstration of the pattern
colmetadata(df, :col1) # how metadata works can now depend on the dataset
This wouldn’t be the only way of doing it. Another possibility would be to have a parametric type like DataFrameWithMetadata{D<:Designator}
or some combination thereof. I don’t have strong opinions about what the right way of doing it is as I’ve been developing this pattern relatively recently.
So the gist of what I’m saying is this: Yes, it’s nice to have the ability to keep metadata with DataFrame
s, but it’s even more important to define bahavior for tables differently depending on their metadata. Simply keeping some Dict
s with your columns is fine but it doesn’t address this issue and I fear may prove to be far too inflexible. As I tried to demonstrate a little bit here, yes, I do always have metadata that I have to keep track of, but it’s hard for me to say a priori how I will use it for any particular data set. I suspect this is part of the reason why including metadata within a data frame object of some kind isn’t a more common pattern.
Again, this approach also has the added benefit of keeping the DataFrame
object itself as simple as possible, which has a lot of appeal for me as ultimately I work on numeric problems and I found the simple Julia DataFrame
s to be a joy to work with compared to the unwieldy pandas equivalents which held a whole bunch of numpy arrays which never seemed to be correctly typed, unhackable C methods and the inability to use generic stdlib tools because they are just too slow. (That’s not to say we’d suddenly have all these problems in Julia if we were to just add metadata, of course we wouldn’t.)
Wow! Do you really think this is pythonic way how to do what you want to do??
It seems to be a common type of pattern that I see in Python code, I’m not sure if I’d call it “Pythonic”. It probably would have been better just to say that it is far less generic than Julia code often is. A better way to do things in Python would probably be to create classes that inherit from pandas dataframes, which I have occasionally seen done. I suppose with Python being an OO language inheritance would probably be more “Pythonic” than just writing some functions, whatever that means.
This is a great explanation, thanks! Still trying to imagine the implications, but I think I might try to steal the idea for my own stuff . Really clever way to use julia’s unique strengths…
Yes, please do this. I haven’t really been doing things this way long enough to work out all the implications myself, so having other people doing it and comparing notes would be really good.
My main point was that (for me at least) metadata is as much about behavior as it is about being able to access it!
You’ll really see same in Julia when it will be more popular and average coders will use it.
Yes. Solution you are proposing here is simply implementable in Python’s OO too. (I don’t say that it is true in other problems or solutions)
Dear all,
this topic has been inactive for a few months, but I just wanted to add that I implemented a package to solve the first problem posted, namely how to add metadata information to a DataFrame.
The trick is to use composition and the ReusePatterns package to automatically forward all method calls from one type to another. The relevant code is:
using ReusePatterns
struct DataFrameMeta <: AbstractDataFrame
p::DataFrame
meta::Dict{String, Any}
DataFrameMeta(args...; kw...) = new(DataFrame(args...; kw...), Dict{Symbol, Any}())
DataFrameMeta(df::DataFrame) = new(df, Dict{Symbol, Any}())
end
meta(d::DataFrameMeta) = getfield(d,:meta) # <-- new functionality added to DataFrameMeta
@forward((DataFrameMeta, :p), DataFrame) # <-- reuse all existing functionalities
while the whole example can be found here.
With the above code we can use an object of type DataFrameMeta
as if it was a simple DataFrame
, but taking advantage of the new meta
method to gain access to the associated metadata dictionary.
Comments are welcome!
This may be a basic question…
Suppose I define a new type of dataframe (a Panel dataframe) akin to your DataFrameMeta dataframe. I forward all functions in the Dataframe module to my Panel type using your @forward macro.
I am doing this in a module of my own, and would like to export all of the functions defined by @forward. This way, I can import a Panel module and functions specific for panel dataframes.
I have no idea how to export the functions defined by @forward. I am not even sure if this is a good idea.
Would it be better to call it metadataframe, because there is already a dataframesmeta package