Dpylr do equivalent in query.jl standalone syntax?

I love Query.jl’s standalone syntax, but I’m having trouble with finding something comparable to dplyr’s do syntax. Essentially a lot of what I need to do is compute an average/median/whatever of a group and then normalize all values in the group to that value and then return the original dataframe with the normalized-by-group values.

I do something like this currently:

julia> using Query, DataFrames, StatsBase

julia> ex = DataFrame(:a=>[1,2,3,4,5,6,7,8], :b=>repeat([:a, :b], inner=(4)))
8×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ Symbol │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ a      │
│ 3   │ 3     │ a      │
│ 4   │ 4     │ a      │
│ 5   │ 5     │ b      │
│ 6   │ 6     │ b      │
│ 7   │ 7     │ b      │
│ 8   │ 8     │ b      │

julia> result = ex |> 
           @groupby(_.b) |>
           @map({group=key(_), avg=mean(_.a)}) |>
           DataFrame
2×2 DataFrame
│ Row │ group  │ avg     │
│     │ Symbol │ Float64 │
├─────┼────────┼─────────┤
│ 1   │ a      │ 2.5     │
│ 2   │ b      │ 6.5     │

julia> out = join(ex, result, on=[:b=>:group]);

julia> out[:normed] = out[:a] ./ out[:avg];

julia> out
8×4 DataFrame
│ Row │ a     │ b      │ avg     │ normed   │
│     │ Int64 │ Symbol │ Float64 │ Float64  │
├─────┼───────┼────────┼─────────┼──────────┤
│ 1   │ 1     │ a      │ 2.5     │ 0.4      │
│ 2   │ 2     │ a      │ 2.5     │ 0.8      │
│ 3   │ 3     │ a      │ 2.5     │ 1.2      │
│ 4   │ 4     │ a      │ 2.5     │ 1.6      │
│ 5   │ 5     │ b      │ 6.5     │ 0.769231 │
│ 6   │ 6     │ b      │ 6.5     │ 0.923077 │
│ 7   │ 7     │ b      │ 6.5     │ 1.07692  │
│ 8   │ 8     │ b      │ 6.5     │ 1.23077  │

Is there a cleaner way of doing this using Query.jl’s standalone operators?

1 Like

I haven’t played with this much yet but you could probably use the @let multiple times throughout a query such as here
https://www.queryverse.org/Query.jl/stable/linqquerycommands/#Range-variables-1

How about this:

julia> ex |>
       @groupby(_.b) |>
       @map({rows=_, avg=mean(_.a)}) |>
       @mapmany(_.rows, {__..., _.avg, normed = __.a/_.avg})

8x4 query result
a │ b  │ avg │ normed  
──┼────┼─────┼─────────
1 │ :a │ 2.5 │ 0.4
2 │ :a │ 2.5 │ 0.8
3 │ :a │ 2.5 │ 1.2
4 │ :a │ 2.5 │ 1.6     
5 │ :b │ 6.5 │ 0.769231
6 │ :b │ 6.5 │ 0.923077
7 │ :b │ 6.5 │ 1.07692
8 │ :b │ 6.5 │ 1.23077

Let me know if this is clear or whether I should elaborate a bit what is actually going on there :slight_smile:

5 Likes

Whoa cool. So what does the _ _ refer to on the @mapmany line? It looks like it’s the original dataframe somehow, but how does it know that that’s what you’re referring to?

If we look at the output just before the @mapmany call, we get this:

julia> ex |>
       @groupby(_.b) |>
       @map({rows=_, avg=mean(_.a)}) 

2x2 query result
rows                                                                                                         │ avg
─────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────
NamedTuple{(:a, :b),Tuple{Int64,Symbol}}[(a = 1, b = :a), (a = 2, b = :a), (a = 3, b = :a), (a = 4, b = :a)] │ 2.5
NamedTuple{(:a, :b),Tuple{Int64,Symbol}}[(a = 5, b = :b), (a = 6, b = :b), (a = 7, b = :b), (a = 8, b = :b)] │ 6.5

So this is a table with two columns, the second column is the average per group, and the first column has a list of all the rows that belong into each group. So we have a sort of list of lists here. The next step then is to unpack this list again, you can think of that as some kind of ungroup operation. That is what the @mapmany call does.

I’m using this syntax for anonymous functions here, so the line

@mapmany(_.rows, {__..., _.avg, normed = __.a/_.avg})

could also be written as

@mapmany(i -> i.rows, (i,j) -> {j..., i.avg, normed = j.a/i.avg})

in standard julia notation.

What does this @mapmany call do? The first argument is an anonymous function that will be called for each group, and then needs to return a collection for each group. In this case it returns each the rows that are in a single group. @mapmany will then call the second anonymous function for each item in the collection that was returned by this first anonymous function. The call to this second anonymous function will take the group as argument i, and the individual row as argument j. We then construct a new named tuple, where first splat all the columns from the original row in as the first set of columns (using the j... syntax), and then we add two more columns.

6 Likes

Thanks David! That makes a lot of sense. I read the docs for @mapmany and felt like it might be what I needed, but the examples didn’t give me a good intuition for what was going on. I think an example like this in the docs for @mapmany would be very helpful to have!

2 Likes

Hi @davidanthoff, what is the purpose of the curly brakcets in the @map? I can’t find in the documentation.

Another fun thing I was doing is filtering rows based on the properties of their groups and I ran into the following error:

julia> ex = DataFrame(:a=>[1,2,3,4,5,6,7,8], :b=>repeat([:a, :b, :c, :d], inner=(2)))
8×2 DataFrame
│ Row │ a     │ b      │
│     │ Int64 │ Symbol │
├─────┼───────┼────────┤
│ 1   │ 1     │ a      │
│ 2   │ 2     │ a      │
│ 3   │ 3     │ b      │
│ 4   │ 4     │ b      │
│ 5   │ 5     │ c      │
│ 6   │ 6     │ c      │
│ 7   │ 7     │ d      │
│ 8   │ 8     │ d      │

julia> ex |>
              @groupby(_.b) |>
              @map({rows=_, avg=mean(_.a)})|>
               @filter(_.avg > 2) |>
               @mapmany(_.rows, {__...}) |>
               DataFrame
ERROR: ArgumentError: unable to construct DataFrame from QueryOperators.EnumerableMapMany{Tuple{Int64,Vararg{Union{Int64, Symbol},N} where N},QueryOperators.EnumerableIterable{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableFilter{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableIterable{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableMap{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableIterable{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},QueryOperators.EnumerableGroupBy{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}},QueryOperators.EnumerableIterable{NamedTuple{(:a, :b),Tuple{Int64,Symbol}},Tables.DataValueRowIterator{NamedTuple{(:a, :b),Tuple{Int64,Symbol}},Tables.RowIterator{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Symbol,1}}}}}},getfield(Main, Symbol("##52#62")),getfield(Main, Symbol("##53#63"))}},getfield(Main, Symbol("##55#65"))}},getfield(Main, Symbol("##57#67"))}},getfield(Main, Symbol("##59#69")),getfield(Main, Symbol("##60#70"))}
Stacktrace:
 [1] DataFrame(::QueryOperators.EnumerableMapMany{Tuple{Int64,Vararg{Union{Int64, Symbol},N} where N},QueryOperators.EnumerableIterable{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableFilter{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableIterable{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableMap{NamedTuple{(:rows, :avg),Tuple{Grouping{Symbol,NamedTuple{(:a, :b),Tuple{Int64,Symbol}}},Float64}},QueryOperators.EnumerableIterable{Grouping{Symbol,NamedTup

This only seems to happen when I splat and don’t include any new columns because if I include a new column (like in your example) it works great:

julia> ex |>
              @groupby(_.b) |>
              @map({rows=_, avg=mean(_.a)})|>
               @filter(_.avg > 2) |>
               @mapmany(_.rows, {__..., _.avg}) |>
               DataFrame
6×3 DataFrame
│ Row │ a     │ b      │ avg     │
│     │ Int64 │ Symbol │ Float64 │
├─────┼───────┼────────┼─────────┤
│ 1   │ 3     │ b      │ 3.5     │
│ 2   │ 4     │ b      │ 3.5     │
│ 3   │ 5     │ c      │ 5.5     │
│ 4   │ 6     │ c      │ 5.5     │
│ 5   │ 7     │ d      │ 7.5     │
│ 6   │ 8     │ d      │ 7.5     │


There is a brief mention of if here, and I just created an item to create proper docs for it here.

The curly brackets are an alternative syntax to create a named tuple. Relative to the normal syntax, it provides a couple of enhancements, mainly that it can auto-name columns. For example {_.a, _.b} is equivalent to (a=_.a, b=_.b).

1 Like

That strikes me as a bug, I’m tracking it here for now.

There is a simple workaround, though. There is no need to splat just one named tuple into a new one, you can just write for that one line

@mapmany(_.rows, __)

That is equivalent to writing

@mapmany(i->i.rows, (i,j)->j)

And because the j here is already a named tuple, you can just return that directly and it should all work.

1 Like

I have to do things like this all the time. Here is my proposal:

ex |> 
    @groupby(_.b) |>
    @map( DataFrame(
                  a=_.a, b=_.b, 
                  avg = mean(_.a), 
                  normed = _.a ./ mean(_.a) )) |> 
    (x -> reduce(vcat,x))

inside the map I create a DataFrame so that I can store the entire vectors and repeat single values as needed using the default constructor. I then vcat the list of DataFrame using reduce. There might be a more elegant way to combine a list of tuple of vectors to a DataFrame. Would be happy to see it!

For info, here is the same using DataFrames functions:

transform( 
   groupby(ex,:b), 
   :a => (x -> mean(x)) => :avg, 
   :a => (x -> x ./ mean(x)) => :normed
) 

or with DataFramesMeta.jl

using DataFramesMeta
@linq ex |> 
    groupby(:b) |> 
    transform( avg = mean(:a), normed =  :a ./ mean(:a))