Window like functions in DataFrames

Skoffer · December 21, 2017, 11:27am

I wonder, what is the best way to apply window functions to DataFrames? In simplest case, window functions is a sort of aggregating functions that don’t change number of rows of original dataframe and only adds some new columns. I was able to do what I want with the following code:

df = DataFrame(a = [1, 1, 2, 2], b = [1, 3, 8, 10])
by(df, :a) do d
    DataFrame(b = d[:b], bmax = maximum(d[:b]), bmin = minimum(d[:b]))
end

but it feels very ugly. First of all creation of new DataFrames looks excessive, and secondly creation of column b is very repetitive. If I have one column it’s more or less fine, but in case of 10 or 50 columns retyping them again and again is way too much. I have a feeling that I am missing something obvious here.

nalimilan · December 21, 2017, 11:44am

I don’t think we have a good solution for this right now, except maybe using DataFramesMeta or Query.

We should be able to get rid of the creation of a DataFrame by using NamedTuple, which will be in Julia 0.7. I guess we could also add an argument to by to keep all columns in the data frame.

See also
https://github.com/JuliaData/DataFrames.jl/issues/1256

Skoffer · December 21, 2017, 11:53am

Thank you, it’s not a problem to wait until Julia 0.7. Meanwhile DataFramesMeta solution is fine for my current purposes. I’ll leave it here in case anyone is interested.

using DataFramesMeta

df = DataFrame(a = [1, 1, 2, 2], b = [1, 3, 8, 10])
@transform(groupby(mdf, :a), bmax = maximum(:b), bmin = minimum(:b))

nalimilan · December 21, 2017, 12:03pm

Actually, usability should be better with DataFramesMeta and Query now and in the long term, since their macros allow doing things which are not possible with the standard Julia evaluation rules (like referring to variables with just a symbol). There’s nothing wrong with using them instead of the basic DataFrames API, which is likely to remain relatively limited anyway for this reason.

ExpandingMan · December 21, 2017, 2:31pm

I only just now realized that DataFramesMeta is being worked on again. I’m very excited about this, it’s a very nice package! Thanks to everyone who was involved in getting it to work with DataFrames 0.11.

I think it would be nice if DataFramesMeta figured even more prominently in the DataFrames documentation.

davidanthoff · December 21, 2017, 6:23pm

I’m afraid there is also no good solution for this in Query.jl. I’ve been thinking about it a bit and it is on my radar (Investigate windows/rolling functions · Issue #126 · queryverse/Query.jl · GitHub) but I’m just not sure yet how to fit it in, so I wouldn’t expect anything anytime soon.

ExpandingMan · December 21, 2017, 9:18pm

I’m a little unclear on why using by as suggested in the original post is inadequate. Is there some performance issue with this? My impression was that DataFrames are sufficiently lightweight that returning a large number of them from a groupby operation is fairly benign (certainly my own experience would indicate this). Using by also has the advantage of being completely obvious, whereas adding dedicated functions for this would add some complexity to the API.

Perhaps the issue is having to explicitly write existing columns in the output DataFrames? A simple solution to this would be to have some DataFrame constructors that take SubDataFrame as an argument.

s-baumann · January 27, 2022, 11:33am

Is there any updated way to do window functions in DataFrames?

To adapt Skoffer’s example I basically want to do something like this:

function min_to_date_twolag(x)
    xx = x[max(1,length(x)-1):length(x)]
    return minimum(xx)
end
using DataFramesMeta
df = DataFrame(a = [1, 1, 1, 2, 2, 2], date = [1,2,3,1,2,3], b = b = [0,4,8, 4,8,2])
desired_result = @transform(groupby(df, :a), :min_recently = min_to_date_twolag(:b))

Here I am doing a window function min_to_date_twolag which takes the minimum of the current value and the previous value for each particular group. I want the desired result to come out something like the below:

desired_result = DataFrame(a = [1, 1, 1, 2, 2, 2], date = [1,2,3,1,2,3], b = [0,4,8, 4,8,2],
                min_recently = [0,0,4, 4,4,2])

However the above code doesn’t work as it takes the entire array (in the b column for each group) rather than the values in a window around each row.

nilshg · January 27, 2022, 12:04pm

Like this?

julia> using RollingFunctions

julia> transform!(groupby(df, :a), :b => (x -> running(minimum, x, 2)) => :min_recently)
6×4 DataFrame
 Row │ a      date   b      min_recently 
     │ Int64  Int64  Int64  Float64      
─────┼───────────────────────────────────
   1 │     1      1      0           0.0
   2 │     1      2      4           0.0
   3 │     1      3      8           4.0
   4 │     2      1      4           4.0
   5 │     2      2      8           4.0
   6 │     2      3      2           2.0

rafael.guerra · January 27, 2022, 3:07pm

And a package-less alternative:

min_to_date_lag(x, lag) = [minimum(view(x,max(1,i-lag+1):i)) for i in axes(x,1)]

transform!(groupby(df, :a), :b => (x -> min_to_date_lag(x,2)) => :min_recently)

sijo · January 27, 2022, 3:50pm

A variation of @rafael.guerra’s answer:

min_cur_prev(x) = min.(x, [x[begin]; x[begin:end-1]])

transform(groupby(df, :a), :b => min_cur_prev => :min_recently)

Topic		Replies	Views
How DataFramesMeta @by works if you want to groupby two columns? New to Julia dataframes , dataframesmeta	10	772	October 31, 2022
Help with performance tuning this dataframe aggregation Performance	10	740	September 23, 2018
Adding calculated columns to dataframe New to Julia plotting , dataframes	18	1950	October 30, 2021
Transform in DataFrames General Usage dataframes	13	432	January 21, 2024
Group DataFrames by a function of a column Data package	4	1204	December 11, 2019

Window like functions in DataFrames

Related topics