Dpylr do equivalent in query.jl standalone syntax?

tlnagy · May 29, 2019, 8:23pm

I love Query.jl’s standalone syntax, but I’m having trouble with finding something comparable to dplyr’s do syntax. Essentially a lot of what I need to do is compute an average/median/whatever of a group and then normalize all values in the group to that value and then return the original dataframe with the normalized-by-group values.

I do something like this currently:

julia> using Query, DataFrames, StatsBase

julia> ex = DataFrame(:a=>[1,2,3,4,5,6,7,8], :b=>repeat([:a, :b], inner=(4)))
8×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ Symbol │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ a      │
│ 3   │ 3     │ a      │
│ 4   │ 4     │ a      │
│ 5   │ 5     │ b      │
│ 6   │ 6     │ b      │
│ 7   │ 7     │ b      │
│ 8   │ 8     │ b      │

julia> result = ex |> 
           @groupby(_.b) |>
           @map({group=key(_), avg=mean(_.a)}) |>
           DataFrame
2×2 DataFrame
│ Row │ group  │ avg     │
│     │ Symbol │ Float64 │
├─────┼────────┼─────────┤
│ 1   │ a      │ 2.5     │
│ 2   │ b      │ 6.5     │

julia> out = join(ex, result, on=[:b=>:group]);

julia> out[:normed] = out[:a] ./ out[:avg];

julia> out
8×4 DataFrame
│ Row │ a     │ b      │ avg     │ normed   │
│     │ Int64 │ Symbol │ Float64 │ Float64  │
├─────┼───────┼────────┼─────────┼──────────┤
│ 1   │ 1     │ a      │ 2.5     │ 0.4      │
│ 2   │ 2     │ a      │ 2.5     │ 0.8      │
│ 3   │ 3     │ a      │ 2.5     │ 1.2      │
│ 4   │ 4     │ a      │ 2.5     │ 1.6      │
│ 5   │ 5     │ b      │ 6.5     │ 0.769231 │
│ 6   │ 6     │ b      │ 6.5     │ 0.923077 │
│ 7   │ 7     │ b      │ 6.5     │ 1.07692  │
│ 8   │ 8     │ b      │ 6.5     │ 1.23077  │

Is there a cleaner way of doing this using Query.jl’s standalone operators?

Zach_Christensen · May 30, 2019, 12:26am

I haven’t played with this much yet but you could probably use the @let multiple times throughout a query such as here
https://www.queryverse.org/Query.jl/stable/linqquerycommands/#Range-variables-1

davidanthoff · May 30, 2019, 7:24pm

How about this:

julia> ex |>
       @groupby(_.b) |>
       @map({rows=_, avg=mean(_.a)}) |>
       @mapmany(_.rows, {__..., _.avg, normed = __.a/_.avg})

8x4 query result
a │ b  │ avg │ normed  
──┼────┼─────┼─────────
1 │ :a │ 2.5 │ 0.4
2 │ :a │ 2.5 │ 0.8
3 │ :a │ 2.5 │ 1.2
4 │ :a │ 2.5 │ 1.6     
5 │ :b │ 6.5 │ 0.769231
6 │ :b │ 6.5 │ 0.923077
7 │ :b │ 6.5 │ 1.07692
8 │ :b │ 6.5 │ 1.23077

Let me know if this is clear or whether I should elaborate a bit what is actually going on there

tlnagy · May 31, 2019, 2:22am

Whoa cool. So what does the _ _ refer to on the @mapmany line? It looks like it’s the original dataframe somehow, but how does it know that that’s what you’re referring to?

davidanthoff · May 31, 2019, 6:19am

If we look at the output just before the @mapmany call, we get this:

julia> ex |>
       @groupby(_.b) |>
       @map({rows=_, avg=mean(_.a)}) 

2x2 query result
rows                                                                                                         │ avg
─────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────
NamedTuple{(:a, :b),Tuple{Int64,Symbol}}[(a = 1, b = :a), (a = 2, b = :a), (a = 3, b = :a), (a = 4, b = :a)] │ 2.5
NamedTuple{(:a, :b),Tuple{Int64,Symbol}}[(a = 5, b = :b), (a = 6, b = :b), (a = 7, b = :b), (a = 8, b = :b)] │ 6.5

So this is a table with two columns, the second column is the average per group, and the first column has a list of all the rows that belong into each group. So we have a sort of list of lists here. The next step then is to unpack this list again, you can think of that as some kind of ungroup operation. That is what the @mapmany call does.

I’m using this syntax for anonymous functions here, so the line

@mapmany(_.rows, {__..., _.avg, normed = __.a/_.avg})

could also be written as

@mapmany(i -> i.rows, (i,j) -> {j..., i.avg, normed = j.a/i.avg})

in standard julia notation.

What does this @mapmany call do? The first argument is an anonymous function that will be called for each group, and then needs to return a collection for each group. In this case it returns each the rows that are in a single group. @mapmany will then call the second anonymous function for each item in the collection that was returned by this first anonymous function. The call to this second anonymous function will take the group as argument i, and the individual row as argument j. We then construct a new named tuple, where first splat all the columns from the original row in as the first set of columns (using the j... syntax), and then we add two more columns.

tlnagy · May 31, 2019, 10:45pm

Thanks David! That makes a lot of sense. I read the docs for @mapmany and felt like it might be what I needed, but the examples didn’t give me a good intuition for what was going on. I think an example like this in the docs for @mapmany would be very helpful to have!

affans · May 31, 2019, 11:20pm

Hi @davidanthoff, what is the purpose of the curly brakcets in the @map? I can’t find in the documentation.

tlnagy · June 2, 2019, 7:40pm

Another fun thing I was doing is filtering rows based on the properties of their groups and I ran into the following error:

julia> ex = DataFrame(:a=>[1,2,3,4,5,6,7,8], :b=>repeat([:a, :b, :c, :d], inner=(2)))
8×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ Symbol │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ a      │
│ 3   │ 3     │ b      │
│ 4   │ 4     │ b      │
│ 5   │ 5     │ c      │
│ 6   │ 6     │ c      │
│ 7   │ 7     │ d      │
│ 8   │ 8     │ d      │

julia> ex |>
              @groupby(_.b) |>
              @map({rows=_, avg=mean(_.a)})|>
               @filter(_.avg > 2) |>
               @mapmany(_.rows, {__...}) |>
               DataFrame
ERROR: ArgumentError: unable to construct DataFrame from QueryOperators.EnumerableMapMany{Tuple{Int64,Vararg{Union{Int64, Symbol},N} where N},QueryOperators.EnumerableIterable{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableFilter{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableIterable{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableMap{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableIterable{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},QueryOperators.EnumerableGroupBy{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,Symbol}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,Symbol}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Symbol,1}}}}}},getfield(Main, Symbol("##52#62")),getfield(Main, Symbol("##53#63"))}},getfield(Main, Symbol("##55#65"))}},getfield(Main, Symbol("##57#67"))}},getfield(Main, Symbol("##59#69")),getfield(Main, Symbol("##60#70"))}
Stacktrace:
 [1] DataFrame(::QueryOperators.EnumerableMapMany{Tuple{Int64,Vararg{Union{Int64, Symbol},N} where N},QueryOperators.EnumerableIterable{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableFilter{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableIterable{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableMap{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableIterable{Grouping{Symbol,NamedTup

This only seems to happen when I splat and don’t include any new columns because if I include a new column (like in your example) it works great:

julia> ex |>
              @groupby(_.b) |>
              @map({rows=_, avg=mean(_.a)})|>
               @filter(_.avg > 2) |>
               @mapmany(_.rows, {__..., _.avg}) |>
               DataFrame
6×3 DataFrame
│ Row │ a     │ b      │ avg     │
│     │ Int64 │ Symbol │ Float64 │
├─────┼───────┼────────┼─────────┤
│ 1   │ 3     │ b      │ 3.5     │
│ 2   │ 4     │ b      │ 3.5     │
│ 3   │ 5     │ c      │ 5.5     │
│ 4   │ 6     │ c      │ 5.5     │
│ 5   │ 7     │ d      │ 7.5     │
│ 6   │ 8     │ d      │ 7.5     │

davidanthoff · June 2, 2019, 8:25pm

There is a brief mention of if here, and I just created an item to create proper docs for it here.

The curly brackets are an alternative syntax to create a named tuple. Relative to the normal syntax, it provides a couple of enhancements, mainly that it can auto-name columns. For example {_.a, _.b} is equivalent to (a=_.a, b=_.b).

davidanthoff · June 2, 2019, 8:29pm

That strikes me as a bug, I’m tracking it here for now.

There is a simple workaround, though. There is no need to splat just one named tuple into a new one, you can just write for that one line

@mapmany(_.rows, __)

That is equivalent to writing

@mapmany(i->i.rows, (i,j)->j)

And because the j here is already a named tuple, you can just return that directly and it should all work.

Thibaut_Lamadon · March 31, 2021, 11:34pm

I have to do things like this all the time. Here is my proposal:

ex |> 
    @groupby(_.b) |>
    @map( DataFrame(
                  a=_.a, b=_.b, 
                  avg = mean(_.a), 
                  normed = _.a ./ mean(_.a) )) |> 
    (x -> reduce(vcat,x))

inside the map I create a DataFrame so that I can store the entire vectors and repeat single values as needed using the default constructor. I then vcat the list of DataFrame using reduce. There might be a more elegant way to combine a list of tuple of vectors to a DataFrame. Would be happy to see it!

For info, here is the same using DataFrames functions:

transform( 
   groupby(ex,:b), 
   :a => (x -> mean(x)) => :avg, 
   :a => (x -> x ./ mean(x)) => :normed
)

or with DataFramesMeta.jl

using DataFramesMeta
@linq ex |> 
    groupby(:b) |> 
    transform( avg = mean(:a), normed =  :a ./ mean(:a))

Topic		Replies	Views
Query.jl v0.11 released Package Announcements announcement	7	1109	February 4, 2019
Calling macro with runtime arguments: case of Query.jl New to Julia	5	514	March 5, 2019
Query.jl v0.7x released Community announcement	5	1641	September 12, 2017
Comparing DataFrames native API and Query Data	4	1522	September 1, 2017
Query.jl v0.8x released Community announcement	0	985	November 21, 2017

Dpylr do equivalent in query.jl standalone syntax?

Related topics