Query.jl: Group dataframe along two variables

Hello everybody,

I am trying to learn how to use Query.jl more efficiently, because it seems like a great tool once you get the hang of it.

I have the following structure: I have many firms with daily observations. I would like to aggregate a variable (mean and standard deviation) along firm-months. Therefore before applying the mean and the standard deviation, I need to split the dataset by firm-months. One easy possibility is to create a variable firm-month then do something like this:

x = @from i in df begin
    @group i by i.firmmonth into g
    @select {group=key(g), mean_return = mean(g.Ret), std_return = std(g.Ret)}
    @collect DataFrame
end

and afterwards I split the variable “firmmonth” again into two variables. Is there also a way to do the same thing inside one Query.jl block?

Thank you very much!

I see David is already cooking up an answer but just wanted to say that from the next DataFrames release (0.21) select will allow to do this without any need for macros: https://juliadata.github.io/DataFrames.jl/dev/lib/functions/#DataFrames.select

1 Like

Yes, you can do something like this:

@group i by {i.colA, i.ColB} into g

So you are constructing a new named tuple for each row, and then group by that. key(g) will then also return a named tuple, so you would do something like @select {key(g)..., mean_return = mean(g.Ret}). Note how I’m splatting the return value of key(g) here with ....

While the LINQ style syntax that you are using will never go away, most of the new features in Query.jl are coming for the standalone syntax. Rewriting your whole query in that looks like this:

x = df |>
@groupby({_.colA, _.colB}) |>
@map({key(_)..., mean_return = mean(_.Ret)}) |>
DataFrame

I left out the computation of std.

2 Likes

Thanks a lot for the replies. In the end I am doing this:

df_agg_test = @from i in daily_test begin
                     @group i by {i.PERMNO, i.yearmonth} into g
                     @select {group = key(g), monthly_return = 100 .*(prod(skipmissing(g.Ret./100 .+ 1)) .- 1), std_return = std(skipmissing(g.Ret))}
                     @collect DataFrame
end

However, I get the following error:

ERROR: MethodError: no method matching length(::DataValues.DataValue{Float64})
Closest candidates are:
  length(::Core.SimpleVector) at essentials.jl:596
  length(::Base.MethodList) at reflection.jl:852
  length(::Core.MethodTable) at reflection.jl:938
  ...
Stacktrace:
 [1] _similar_for(::UnitRange{Int64}, ::Type{Float64}, ::DataValues.DataValue{Float64}, ::Base.HasLength) at .\array.jl:576
 [2] _collect(::UnitRange{Int64}, ::DataValues.DataValue{Float64}, ::Base.HasEltype, ::Base.HasLength) at .\array.jl:609
 [3] collect(::DataValues.DataValue{Float64}) at .\array.jl:603
 [4] broadcastable(::DataValues.DataValue{Float64}) at .\broadcast.jl:665
 [5] broadcasted at .\broadcast.jl:1235 [inlined]
 [6] (::var"#24#26")(::Grouping{NamedTuple{(:PERMNO, :yearmonth),Tuple{Int64,Date}},NamedTuple{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Date,DataValues.DataValue{Float64},Int64,Int64,Int64,DataValues.DataValue{Float64},DataValues.DataValue{Int64},Date}}}) at .\none:0
 [7] iterate at C:\Users\Ilja\.julia\packages\QueryOperators\g4G21\src\enumerable\enumerable_map.jl:25 [inlined]
 [8] iterate at C:\Users\Ilja\.julia\packages\Tables\okt7x\src\tofromdatavalues.jl:45 [inlined]
 [9] buildcolumns at C:\Users\Ilja\.julia\packages\Tables\okt7x\src\fallbacks.jl:185 [inlined]
 [10] columns at C:\Users\Ilja\.julia\packages\Tables\okt7x\src\fallbacks.jl:237 [inlined]
 [11] DataFrame(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableGroupBySimple{Grouping{NamedTuple{(:PERMNO, :yearmonth),Tuple{Int64,Date}},NamedTuple{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Date,DataValues.DataValue{Float64},Int64,Int64,Int64,DataValues.DataValue{Float64},DataValues.DataValue{Int64},Date}}},NamedTuple{(:PERMNO, :yearmonth),Tuple{Int64,Date}},NamedTuple{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Date,DataValues.DataValue{Float64},Int64,Int64,Int64,DataValues.DataValue{Float64},DataValues.DataValue{Int64},Date}},QueryOperators.EnumerableIterable{NamedTuple{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Date,DataValues.DataValue{Float64},Int64,Int64,Int64,DataValues.DataValue{Float64},DataValues.DataValue{Int64},Date}},Tables.DataValueRowIterator{NamedTuple{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Date,DataValues.DataValue{Float64},Int64,Int64,Int64,DataValues.DataValue{Float64},DataValues.DataValue{Int64},Date}},Tables.Schema{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Date,Union{Missing, Float64},Int64,Int64,Int64,Union{Missing, Float64},Union{Missing, Int64},Date}},Tables.RowIterator{NamedTuple{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Array{Date,1},Array{Union{Missing, Float64},1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Union{Missing, Float64},1},Array{Union{Missing, Int64},1},Array{Date,1}}}}}},var"#23#25"},var"#24#26"}; copycols::Bool) at C:\Users\Ilja\.julia\packages\DataFrames\S3ZFo\src\other\tables.jl:40
 [12] DataFrame(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableGroupBySimple{Grouping{NamedTuple{(:PERMNO, :yearmonth),Tuple{Int64,Date}},NamedTuple{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Date,DataValues.DataValue{Float64},Int64,Int64,Int64,DataValues.DataValue{Float64},DataValues.DataValue{Int64},Date}}},NamedTuple{(:PERMNO, :yearmonth),Tuple{Int64,Date}},NamedTuple{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Date,DataValues.DataValue{Float64},Int64,Int64,Int64,DataValues.DataValue{Float64},DataValues.DataValue{Int64},Date}},QueryOperators.EnumerableIterable{NamedTuple{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Date,DataValues.DataValue{Float64},Int64,Int64,Int64,DataValues.DataValue{Float64},DataValues.DataValue{Int64},Date}},Tables.DataValueRowIterator{NamedTuple{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Date,DataValues.DataValue{Float64},Int64,Int64,Int64,DataValues.DataValue{Float64},DataValues.DataValue{Int64},Date}},Tables.Schema{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Date,Union{Missing, Float64},Int64,Int64,Int64,Union{Missing, Float64},Union{Missing, Int64},Date}},Tables.RowIterator{NamedTuple{(:date, :Ret, :PERMNO, :EX, :SIC, :Prc, :Numtrd, :yearmonth),Tuple{Array{Date,1},Array{Union{Missing, Float64},1},Array{Int64,1},Array{Int64,1},Array{Int64,1},Array{Union{Missing, Float64},1},Array{Union{Missing, Int64},1},Array{Date,1}}}}}},var"#23#25"},var"#24#26"}) at C:\Users\Ilja\.julia\packages\DataFrames\S3ZFo\src\other\tables.jl:31
 [13] top-level scope at REPL[22]:1
 [14] eval(::Module, ::Any) at .\boot.jl:331
 [15] eval_user_input(::Any, ::REPL.REPLBackend) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\REPL\src\REPL.jl:86
 [16] run_backend(::REPL.REPLBackend) at C:\Users\Ilja\.julia\packages\Revise\WkyNB\src\Revise.jl:1023
 [17] top-level scope at none:0
 [18] eval(::Module, ::Any) at .\boot.jl:331
 [19] eval_user_input(::Any, ::REPL.REPLBackend) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\REPL\src\REPL.jl:86
 [20] run_backend(::REPL.REPLBackend) at C:\Users\Ilja\.julia\packages\Revise\WkyNB\src\Revise.jl:1023
 [21] top-level scope at none:0

I got this after I substituted some of the entries in my column :Ret with missing. Can I do something about it or should I just drop the rows with missing?

Query.jl uses DataValue for missing values instead of the missing story from base (the missing design lacks some important properties for something like Query.jl). So inside a Query.jl query you will not see missing values, but instead NA values from DataValues.jl. When you materialize the query result into a DataFrame, you’ll get missing values again. The general philosophy here is that inside of queries we use DataValue, but if you start and end with a data structure that uses missing, we accept that.

The way to fix your query is just to use dropna instead of skipmissing :slight_smile: