Simplifying this dataframe aggregation operation

Sometimes I struggle to think of a compact way of doing things. Well, most times.

I have a dataframe of election results for a given election year:

year  state   office           district     party          votecount    name
2012   AL    representative        1       Republican        99000    Bob Smith

It’s got every state, office, district, party. I have these going back to 1970. There is one district for president or senator, but potentially many districts for representative. I keep year so that when I recombine or use the results elsewhere, the year is there. Soon, I’ll get the presidential votes broken down by district, too.

I want an output dataframe that has just one row per office per district that contains the maximum votegetter for that office/district. I couldn’t think of how to do it with the DataFramesMeta package.

So, I wrote a function:

function maxvoteperoffice(df)
    result = DataFrame([Int,String,Int,String,String,Int],
            [:Year, :State, :CD, :Office, :Party, :Votecount], 0) 
    sort!(df,cols=(:State, :Office, :CD, :Votecount), rev = (false, false, false, true))
    prevstate = ""
    prevoffice = ""
    prevcd = 0
    for r in eachrow(df)
        if prevstate == r.State && prevoffice == r.Office && prevcd == r.CD
            continue
        else
            push!(result,[r.Year r.State r.CD r.Office r.Party r.Votecount])
            prevstate = r.State
            prevoffice = r.Office
            prevcd = r.CD
        end
    end
    return result
end

Not being very clever, I sorted the thing (fast–we’re talking 1500 to 2000 rows for the input df). Then, I grab the votecount in the first row for each Office/district/state. Then, I keep skipping until I reach a new Office/district/state.

It’s not horribly slow at .02 seconds. It’s not horrible to sort: any grouping strategy relies on sorting. But, it’s a bit clumsy and not at all general.

Is there a clever way using the database oriented functions/macros for dataframes? You don’t have to rewrite this–just point me in the right direction with the crucial one-liner or approach. I thought a pipe line of @linq operations would do it, but I couldn’t figure out how to get the maximum of a group.

I won’t die if this goes a-begging. The function works.

1 Like

I think you want something like a transform on a grouped dataframe.

df2 = @linq df |>
groupby([:State, :Office]) |>
transform(m = maximum(:votecount)) |>
where(:votecount .== :m)
1 Like

That’s awesome. I can’t say I really get the semantics of the macros–not really like SQL at all. Actually, staring at it a bit it’s clearer–just a funny order for the clauses.

But, with a few minor tweaks it worked the first time:

@time df = @linq res2002 |>
       groupby([:State,:Office,:CD]) |>
       transform(m=maximum(:Votecount)) |>
       where(:Votecount .== :m);

It’s 5x slower than the function, which is both interesting and doesn’t matter for its compactness and ease of doing various other aggregations.

I see you’re using @time - be aware that this includes compile time. If you want a more accurate measurement, use BenchmarkTools’ @btime, which makes sure not to include compilation time.

I would be surprised if, after using BenchmarkTools to time in such a way as to account for compilation time, the function version you gave above was much faster. The push! statements in your original function are likely very allocation-heavy, requiring julia to re-size arrays each iteration.

But the grouped transform operation from DataFramesMeta might also have improved performance in the future, so performance will continue to improve.

Thanks for the correction on using @btime. Sometimes it might be interesting to include compile. With functions you can run twice and use @time. …not in the REPL though. @btime takes longer to do its job (running multiple passes I believe).

You are indeed correct about timing. I was comparing a compiled function to a top-level statement.

Here are the results:

@btime df = @linq res2012 |>
              groupby([:State,:Office,:CD]) |>
              transform(m=maximum(:Votecount)) |>
              where(:Votecount .== :m);
  2.307 ms (27962 allocations: 1.14 MiB)

julia> @btime df = maxvoteperoffice(res2012);
  16.226 ms (121143 allocations: 3.45 MiB)

So, the function with push! has zillions of allocations and takes 8x longer. Now, I’ll get a succinct sensible expression of the aggregation query and it’s faster.

Thanks for the great advice!

+1 vote for DataFramesMeta solutions. I use them a ton and they tend to work very quickly.

Here is how one could do this with Query.jl:

res2012 |>
    @groupby((_.State, _.Office, _CD)) |>
    @map(_[argmax(_.Votecount)]) |>
    DataFrame

I’d be quite interested how performance compares, could you maybe run that query and report back what timing you get for that?

When it needs to reallocate it doubles the size of the underlying buffer.

1 Like

Here you go:

Query.jl:


@btime max2012 = res2012 |>

           @groupby((_.State, _.Office, _.CD)) |>

           @map(_[argmax(_.Votecount)]) |>

           DataFrame;

  1.477 ms (27491 allocations: 1.19 MiB)

DataFramesMeta.jl:


@btime max2012prime = @linq res2012 |>

                  groupby([:State,:Office,:CD]) |>

                  transform(m=maximum(:Votecount)) |>

                  where(:Votecount .== :m);

  2.314 ms (27961 allocations: 1.13 MiB)

Query is 36% faster though the absolute time difference is small. The input df is 1928 x 7 and the result is 518 x 7 so the problem size is not particularly big. Note that the DataFramesMeta approach creates an extra column of Float64, which probably accounts for some of the time increase and certainly the allocations. Not clear if there is a way to avoid that by finishing with an intermediate result and applying argmax.

I had looked briefly at Query and it was not clear it was up to 0.7, but there were no problems adding, precompiling, compiling, running. I would have never been able to figure this out. What’s with the _. Column? And the tuple argument to groupby, and argmax not quite matching ordinary Julia argmax on arrays. Though perhaps that is in my head: I suppose it means return the multicolumn indices within each group for the maximum within the group. The usage of @map is documented as @map(source, element selector) but your usage seems to be function(element selector). I see in experimental features that underscore means anonymous function: so why does column access require an anonymous function? When you give me the code, it is beautiful. I could never have gotten there myself.

Frankly, both approaches are pretty opaque and Query much more so. There is a reason more people prefer Python or R for data-wrangling despite Julia’s better performance. If we expect non-CS graduates to use Julia—I mean intelligent people trying solve a problem in their respective domains—then we have A LOT of work to do. Functional composition is wonderfully versatile and inherently difficult.

Too much novel syntax and way too many ultra-specialized types. I had never seen a GroupedDataFrame and a SubDataFrame before. Sure, it’s much more efficient not to instantiate a whole new dataframe using a bunch of memory for essentially duplicate data—that’s amazing for scalability. But, it is a LOT of mental overhead and a lot of work for package developers and writing functions that either recognize the type or use the appropriate supertype to generalize (my least favorite thing in Julia is type substring… ….by default I should just get an array of string back. If I am splitting a Shakespeare concordance then I can supply the non-default argument to create a lazy (and brilliant) iterator.)

I saw no documentation in the DataFrames doc about groupeddataframe, except an unexplained mention in the types section. It’s very cool—but how to reference a subgroup? This works: mmm[1] -> I get the first ordinal subgroup. But, that’s not useful as the subgroups are defined by column values. What’s the (as always, undocumented) syntax? I tried mmm[:Office.== “president” .& :State .== “AL” .& :CD .== 99 ] but this was wrong for the method getindex(::GroupedDataFrame, ::AbstractArray{Bool,N} where N).

Sorry for being a bit grumpy after you’ve all been very helpful. I just don’t know how anyone who didn’t write the code is expected to understand and use the code. There is a big failure in successful evangelism. We need LOTS more CLEAR documentation and books. The developers will have to help authors understand the code and develop examples. It’s not the same work of genius as creating the code initially, but it is the only way to make the code approachable and expand the user base. Expanding the user base is a new phase.

3 Likes

Maybe you can start and document these findings to both packages.

I don’t even know a fifth of what is required. I would need to interview the developers and get them to do a bunch of quick examples.

I don’t understand this claim. The dplyr version of this would look almost identical:

ans =  res2012 %>%
    group_by(c(State, Office, CD)) %>%
   mutate=max(Votecount)) %>%
   filter(Votecount == m)

You are correct that the docs on grouping operations should be improved. But the end user also doesn’t need to think about what a grouped dataframe is. Like R, they only need to know the behavior when they see a groupby in a chaining block.

In JuliaDBMeta this would be (unfortunately I have to use the conditional because it doens’t work yet on Julia 1.0):

@groupby t (:State, :Office, :CD) _[argmax(:Votecount)]

I have never used R but I wouldn’t say Julia is particular opaque as far as data manipulation is concerned: there are many options, which can be a bit overwhelming at first, but once you settle on one and get to know it, it works quite well.

Concerning the documentation, I find that the (potentially inexperienced) user is in some cases in a better position to write it than the developer as he/she knows what are the pain points / obscure parts that would need a more careful explanation. After spending months or years on some code base it’s really hard to know which parts are intuitive and which are not.

@lewis, I agree with absolutely everything you say about documentation etc. We have a lot of work ahead of us and are nowhere close to a really easy solution at this point! All of this is on my roadmap, but no promises about when I’ll actually get to it myself. Help in the form of PRs (or even feedback like your post here!) is always welcome!

Alright, now lets try to unpack the Query.jl example :slight_smile:

First, the use of the underscore. We can rewrite the whole query without it in the following way:

res2012 |>
    @groupby(i -> (i.State, i.Office, i.CD)) |>
    @map(i -> i[argmax(i.Votecount)]) |>
    DataFrame

So both @map and @groupby actually need an anonymous function as their argument, and the _ syntax is just a shortcut for writing an anonymous function that saves a few characters. Essentially, if you use _ inside any of the Query.jl macros, you don’t have the write the i -> part of the anonymous function, and instead of using i as the argument name in the anonymous function, it uses _ as the argument name.

In general, the anonymous functions you pass to these various query operators get applied to each element from the input sequence one by one. So if you start with a table and do a @groupby, then the i -> (i.State, i.Office, i.CD) anonymous function will be applied to each row of the input table. So the i (or the _ if you use the shorter syntax) represents one row from the input table. i.Foo then access the value from the Foo column from that row.

In the case of this @groupby example, you essentially return a tuple with three elements for each row of the input table. All rows that return the same tuple, will get put into the same group.

Alright, next step. What do we get out of the @groupby step? We get a sequence of groups. Each group is essentially an array of those elements that make up a group. So if you start with a table, you get a sequence of arrays where the elements in each array are rows. Another way to say it is: this is sequence of sequences of rows. So, the @map query operator gets a sequence of groups/arrays. That means that the i (or _ if you use the shortcut syntax) now stands for a group. And that means it is actually just an array of rows, in our case.

Next, lets look at the i.Votecount part. If you have a group of rows, which you can think of as an array of rows (or an array of named tuples), then that is really just a table. And if you use the .Foo syntax on a table (not on a row, like in the @groupby case), then it will return the whole column. So the i.Votecount (or _.Votecount) syntax will extract all the Votecount values for a given group as a vector. So we are just calling argmax on a vector of numbers here. argmax returns an index of the element with the largest value. And now we are indexing into i (or _) with that index value. Remember that i here is an array of rows. So we are actually extracting one specific row here from each group. And so, what is happening here is that the @map overall is returning one row from the original table for each of the groups that we created.

This whole part is not ideal, in my mind. If this gets added to base, this could be written as @map(maximum(_, by=j->j.Votecount)), which I find a bit cleaner.

3 Likes

Also note that I’ve posted an issue at DataFramesMeta so that grouped operations and filtering work like this as well.

1 Like

You guys are great. I’ll see what I can do to help. David, thanks for the patient breakdown of Query.jl’s API and genauguy for also explaining his approach.

I spent some time understanding this better. I tried use map and combine, but wow—that is sloooowww. I used a comprehension to pull out the maximum index for each group, which was not too awful. Then, I had to apply that index within the group and “smash” (technical term) all the subgroups back together. The smashing together is wicked slow because of allocations. The “obvious” solution was not to smash anything back together: rather, convert the index by group into an index which could work across the original dataframe. Well, that was not too hard.

Here is a more general function whose time is inbetween the more general packages: Query.jl < function mvp() < DataFramesMeta though with fewer allocations than both:

@btime mvp(res2012,[:State, :Office, :CD],:Votecount);
  2.046 ms (23765 allocations: 903.22 KiB)
function mvp(df::DataFrame, grpcols::Array{Symbol,1},compcol::Symbol)
    df[index_bygrouptoall(groupby(df,grpcols), compcol),:]
end


function index_bygrouptoall(grpdf::GroupedDataFrame, compcol::Symbol)
    hi = size(grpdf.ends,1)
    ret = Int.(zeros(hi))
    index_grp = [argmax(grpdf[i][compcol]) for (i, _) in enumerate(grpdf)]
    ret[2:end] = index_grp[2:end] .+ grpdf.starts[2:end] .- 1
    ret[1] = index_grp[1]
    return ret
end

I am not at all saying that this functional approach is better than either of the query macros for dataframes. It just decomposes what is going on. In function index_bygrouptoall there might be something better than the comprehension to get all of the bygroup indexes. Map doesn’t work here because it returns a

GroupApplied{DataFrame}(GroupedDataFrame  518 groups with keys: Symbol[:State, :Office, :CD]

object and I think perhaps combine is the only function that has a method for it. I tried vcat—won’t work. And combine does the right thing but the whole package runs in a bit less than a second—400 times slower!

It’s a shame the aggregate function doesn’t work. I think maximum throws it a loop because we don’t want all the rows of each subgroup returned: we want only ONE row per subgroup returned. Or, more likely, I used aggregate incorrectly.

I guess a final step would be to do this with pipe either with a macro or in a do block.

It seems the one slight functional (ignoring syntax entirely) advantage of the Query.jl approach is using ordinary Julia functions within the steps to avoid creating the extra column for max.

Making this more accessible is some combination of explaining more, and/or simplifying syntax, and/or a function or two more that handle cross-row operations. In old-fashioned “business intelligence” or “data mining” the distinction of the specialized products, such as Microsoft’s SQLBI, over straight SQL queries is the ability to provide inter-row operations. That’s exactly what we need in my narrow example: the ability to reduce the number of rows based on the result of some (logical—probably) operation.

Not quite sure where to go here—possibly doing some examples for both Query.jl and DataFramesMeta.jl and documenting them and providing a PR to the docs of both.

Maybe joining forces would help? Without debating the relative merits of either, both address needed functionality for data-wrangling in Julia. With so much more to do (giving credit for already amazing progress), sharing the work on similar goals makes sense and frees up some energy for any number of other cool, useful things.

Well, gosh. We really don’t need the outer function call after all:

res = @btime res2012[index_bygrouptoall(groupby(res2012,[:State,:Office,:CD]),:Votecount),:];
  1.997 ms (23765 allocations: 903.22 KiB)

I just need to add a function argument to index_bygrouptoall. Better look at the source for your query macros… …you were already there!

Could you clarify more what you were expecting? Is there a way to do the operation you wanted in pandas or sql, for example, that you find particularly useful?

Fair enough. I pointed out in my original post that the DataFramesMeta approach was clearer than Query.jl. R’s benefit is broader usage more than anything else. Look, everyone has been really helpful. Julia and many of its packages have made wonderful progress. My comment was that Python and R are more likely to be chosen by “domain experts” than Julia. Everyone gets a choice from a wealth of wonderful tools. Julia could certainly earn those choices more often with what I would consider a bit more focus on user evangelism. I hope no contributor is feeling under appreciated. That was not my intent. All the suggestions I received were significantly better than my own first effort.

1 Like