DataFrame `by` function error


#1

My understanding is that function f will get an AbstractDataFrame and I can do whatever with it. However, it fails when I just return the input. Why? What am I missing?

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
               b = repeat([2, 1], outer=[4]),
               c = randn(8));

by(df, :a, d -> d)

WARNING: Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] #add_names#18(::Bool, ::Function, ::DataFrames.Index, ::DataFrames.Index) at /Users/tomkwong/.julia/v0.6/DataFrames/src/other/index.jl:190
 [3] (::DataFrames.#kw##add_names)(::Array{Any,1}, ::DataFrames.#add_names, ::DataFrames.Index, ::DataFrames.Index) at ./<missing>:0
 [4] #hcat!#71(::Bool, ::Function, ::DataFrames.DataFrame, ::DataFrames.DataFrame) at /Users/tomkwong/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:840
 [5] (::DataFrames.#kw##hcat!)(::Array{Any,1}, ::DataFrames.#hcat!, ::DataFrames.DataFrame, ::DataFrames.DataFrame) at ./<missing>:0
 [6] combine(::DataFrames.GroupApplied{DataFrames.SubDataFrame{Array{Int64,1}}}) at /Users/tomkwong/.julia/v0.6/DataFrames/src/groupeddataframe/grouping.jl:202
 [7] by(::DataFrames.DataFrame, ::Symbol, ::Function) at /Users/tomkwong/.julia/v0.6/DataFrames/src/groupeddataframe/grouping.jl:293

#2

How does it β€œfail”? For me, it just give the warning above, but returns

8Γ—4 DataFrames.DataFrame
β”‚ Row β”‚ a β”‚ a_1 β”‚ b β”‚ c         β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1 β”‚ 1   β”‚ 2 β”‚ 0.140573  β”‚
β”‚ 2   β”‚ 1 β”‚ 1   β”‚ 2 β”‚ -1.87573  β”‚
β”‚ 3   β”‚ 2 β”‚ 2   β”‚ 1 β”‚ -0.793972 β”‚
β”‚ 4   β”‚ 2 β”‚ 2   β”‚ 1 β”‚ -0.339131 β”‚
β”‚ 5   β”‚ 3 β”‚ 3   β”‚ 2 β”‚ -1.68218  β”‚
β”‚ 6   β”‚ 3 β”‚ 3   β”‚ 2 β”‚ 0.169798  β”‚
β”‚ 7   β”‚ 4 β”‚ 4   β”‚ 1 β”‚ -0.111596 β”‚
β”‚ 8   β”‚ 4 β”‚ 4   β”‚ 1 β”‚ 0.353701  β”‚

Probably the fact that by separates by the given columns, does the transformation, and then combines. From the docstring,

For a DataFrame, cols are combined along columns with the resulting DataFrame.

so the no-op would be

by(df, [:a], d -> d[:, [:b, :c]])

#3

In JuliaDB the default is to only pass the β€œnon grouping” columns to the anonymous function, but I’m not sure which if there are strong reasons in favor of one option or the other. Otherwise, having some syntactic sugar to select everything except the grouping columns may help.


#4

It was a surprise to me because I thought the semantic is to summarize data that fits in a single cell for the group. In the above example, I was expecting only 4 rows to be returned and the new column will contain elements that are individual data frames.

I like that fact that the new column can contains objects e.g. passing x -> size(x) would return a tuple type in that column. The current behavior seems inconsistent when the function returns a DataFrame.


#5

I think you’re only seeing this awkward thing with the column names. The problem is that by always tries to construct a dataframe with the columns that were grouped by, so if you try to return a DataFrame for each group, it will wind up trying to duplicate those columns. There really should be a keyword argument that tells it not to try to include the grouped by columns, perhaps it’s worth thinking about more carefully and opening an issue.


#6

Just for reference, the JuliaDB implementation uses the select keyword for this. groupby(df, by, select = Not(by)) is the default and groupby(df, by, select = All()) would select all columns. To use grouping columns in the inner function the recommended approach is (keep in mind that groupby in JuliaDB is the same as by in DataFrames):

groupby(df, by, usekey=true) do key, dd
....
end

#7

I think this should be the default.


#8

That makes sense. Feel free to file an issue/PR. It shouldn’t be hard to do, but we’d better have a quick look at what dplyr and Pandas do too.


#9

I’ve thought about this a bit more. I think that what we should aim for is for

by(identity, df, cols)

to return the original DataFrame except possibly for ordering. This would also agree with @Tamas_Papp suggestion that the default behavior is not to try to insert the grouped by columns.

I’ll try to make a PR, but I’m not sure when I’ll get to it.