Window like functions in DataFrames

question

#1

I wonder, what is the best way to apply window functions to DataFrames? In simplest case, window functions is a sort of aggregating functions that don’t change number of rows of original dataframe and only adds some new columns. I was able to do what I want with the following code:

df = DataFrame(a = [1, 1, 2, 2], b = [1, 3, 8, 10])
by(df, :a) do d
    DataFrame(b = d[:b], bmax = maximum(d[:b]), bmin = minimum(d[:b]))
end

but it feels very ugly. First of all creation of new DataFrames looks excessive, and secondly creation of column b is very repetitive. If I have one column it’s more or less fine, but in case of 10 or 50 columns retyping them again and again is way too much. I have a feeling that I am missing something obvious here.


#2

I don’t think we have a good solution for this right now, except maybe using DataFramesMeta or Query.

We should be able to get rid of the creation of a DataFrame by using NamedTuple, which will be in Julia 0.7. I guess we could also add an argument to by to keep all columns in the data frame.

See also


#3

Thank you, it’s not a problem to wait until Julia 0.7. Meanwhile DataFramesMeta solution is fine for my current purposes. I’ll leave it here in case anyone is interested.

using DataFramesMeta

df = DataFrame(a = [1, 1, 2, 2], b = [1, 3, 8, 10])
@transform(groupby(mdf, :a), bmax = maximum(:b), bmin = minimum(:b))

#4

Actually, usability should be better with DataFramesMeta and Query now and in the long term, since their macros allow doing things which are not possible with the standard Julia evaluation rules (like referring to variables with just a symbol). There’s nothing wrong with using them instead of the basic DataFrames API, which is likely to remain relatively limited anyway for this reason.


#5

I only just now realized that DataFramesMeta is being worked on again. I’m very excited about this, it’s a very nice package! Thanks to everyone who was involved in getting it to work with DataFrames 0.11.

I think it would be nice if DataFramesMeta figured even more prominently in the DataFrames documentation.


#6

I’m afraid there is also no good solution for this in Query.jl. I’ve been thinking about it a bit and it is on my radar (https://github.com/davidanthoff/Query.jl/issues/126) but I’m just not sure yet how to fit it in, so I wouldn’t expect anything anytime soon.


#7

I’m a little unclear on why using by as suggested in the original post is inadequate. Is there some performance issue with this? My impression was that DataFrames are sufficiently lightweight that returning a large number of them from a groupby operation is fairly benign (certainly my own experience would indicate this). Using by also has the advantage of being completely obvious, whereas adding dedicated functions for this would add some complexity to the API.

Perhaps the issue is having to explicitly write existing columns in the output DataFrames? A simple solution to this would be to have some DataFrame constructors that take SubDataFrame as an argument.