Window like functions in DataFrames

I wonder, what is the best way to apply window functions to DataFrames? In simplest case, window functions is a sort of aggregating functions that don’t change number of rows of original dataframe and only adds some new columns. I was able to do what I want with the following code:

df = DataFrame(a = [1, 1, 2, 2], b = [1, 3, 8, 10])
by(df, :a) do d
    DataFrame(b = d[:b], bmax = maximum(d[:b]), bmin = minimum(d[:b]))
end

but it feels very ugly. First of all creation of new DataFrames looks excessive, and secondly creation of column b is very repetitive. If I have one column it’s more or less fine, but in case of 10 or 50 columns retyping them again and again is way too much. I have a feeling that I am missing something obvious here.

I don’t think we have a good solution for this right now, except maybe using DataFramesMeta or Query.

We should be able to get rid of the creation of a DataFrame by using NamedTuple, which will be in Julia 0.7. I guess we could also add an argument to by to keep all columns in the data frame.

See also
https://github.com/JuliaData/DataFrames.jl/issues/1256

1 Like

Thank you, it’s not a problem to wait until Julia 0.7. Meanwhile DataFramesMeta solution is fine for my current purposes. I’ll leave it here in case anyone is interested.

using DataFramesMeta

df = DataFrame(a = [1, 1, 2, 2], b = [1, 3, 8, 10])
@transform(groupby(mdf, :a), bmax = maximum(:b), bmin = minimum(:b))
1 Like

Actually, usability should be better with DataFramesMeta and Query now and in the long term, since their macros allow doing things which are not possible with the standard Julia evaluation rules (like referring to variables with just a symbol). There’s nothing wrong with using them instead of the basic DataFrames API, which is likely to remain relatively limited anyway for this reason.

I only just now realized that DataFramesMeta is being worked on again. I’m very excited about this, it’s a very nice package! Thanks to everyone who was involved in getting it to work with DataFrames 0.11.

I think it would be nice if DataFramesMeta figured even more prominently in the DataFrames documentation.

I’m afraid there is also no good solution for this in Query.jl. I’ve been thinking about it a bit and it is on my radar (Investigate windows/rolling functions · Issue #126 · queryverse/Query.jl · GitHub) but I’m just not sure yet how to fit it in, so I wouldn’t expect anything anytime soon.

I’m a little unclear on why using by as suggested in the original post is inadequate. Is there some performance issue with this? My impression was that DataFrames are sufficiently lightweight that returning a large number of them from a groupby operation is fairly benign (certainly my own experience would indicate this). Using by also has the advantage of being completely obvious, whereas adding dedicated functions for this would add some complexity to the API.

Perhaps the issue is having to explicitly write existing columns in the output DataFrames? A simple solution to this would be to have some DataFrame constructors that take SubDataFrame as an argument.

Is there any updated way to do window functions in DataFrames?

To adapt Skoffer’s example I basically want to do something like this:

function min_to_date_twolag(x)
    xx = x[max(1,length(x)-1):length(x)]
    return minimum(xx)
end
using DataFramesMeta
df = DataFrame(a = [1, 1, 1, 2, 2, 2], date = [1,2,3,1,2,3], b = b = [0,4,8, 4,8,2])
desired_result = @transform(groupby(df, :a), :min_recently = min_to_date_twolag(:b))

Here I am doing a window function min_to_date_twolag which takes the minimum of the current value and the previous value for each particular group. I want the desired result to come out something like the below:

desired_result = DataFrame(a = [1, 1, 1, 2, 2, 2], date = [1,2,3,1,2,3], b = [0,4,8, 4,8,2],
                min_recently = [0,0,4, 4,4,2])

However the above code doesn’t work as it takes the entire array (in the b column for each group) rather than the values in a window around each row.

Like this?

julia> using RollingFunctions

julia> transform!(groupby(df, :a), :b => (x -> running(minimum, x, 2)) => :min_recently)
6×4 DataFrame
 Row │ a      date   b      min_recently 
     │ Int64  Int64  Int64  Float64      
─────┼───────────────────────────────────
   1 │     1      1      0           0.0
   2 │     1      2      4           0.0
   3 │     1      3      8           4.0
   4 │     2      1      4           4.0
   5 │     2      2      8           4.0
   6 │     2      3      2           2.0
3 Likes

And a package-less alternative:

min_to_date_lag(x, lag) = [minimum(view(x,max(1,i-lag+1):i)) for i in axes(x,1)]

transform!(groupby(df, :a), :b => (x -> min_to_date_lag(x,2)) => :min_recently)
2 Likes

A variation of @rafael.guerra’s answer:

min_cur_prev(x) = min.(x, [x[begin]; x[begin:end-1]])

transform(groupby(df, :a), :b => min_cur_prev => :min_recently)
2 Likes