DataFrame `by` function error

tk3369 · May 31, 2018, 5:39am

My understanding is that function f will get an AbstractDataFrame and I can do whatever with it. However, it fails when I just return the input. Why? What am I missing?

df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
               b = repeat([2, 1], outer=[4]),
               c = randn(8));

by(df, :a, d -> d)

WARNING: Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] #add_names#18(::Bool, ::Function, ::DataFrames.Index, ::DataFrames.Index) at /Users/tomkwong/.julia/v0.6/DataFrames/src/other/index.jl:190
 [3] (::DataFrames.#kw##add_names)(::Array{Any,1}, ::DataFrames.#add_names, ::DataFrames.Index, ::DataFrames.Index) at ./<missing>:0
 [4] #hcat!#71(::Bool, ::Function, ::DataFrames.DataFrame, ::DataFrames.DataFrame) at /Users/tomkwong/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:840
 [5] (::DataFrames.#kw##hcat!)(::Array{Any,1}, ::DataFrames.#hcat!, ::DataFrames.DataFrame, ::DataFrames.DataFrame) at ./<missing>:0
 [6] combine(::DataFrames.GroupApplied{DataFrames.SubDataFrame{Array{Int64,1}}}) at /Users/tomkwong/.julia/v0.6/DataFrames/src/groupeddataframe/grouping.jl:202
 [7] by(::DataFrames.DataFrame, ::Symbol, ::Function) at /Users/tomkwong/.julia/v0.6/DataFrames/src/groupeddataframe/grouping.jl:293

Tamas_Papp · May 31, 2018, 8:39am

How does it “fail”? For me, it just give the warning above, but returns

8×4 DataFrames.DataFrame
│ Row │ a │ a_1 │ b │ c         │
├─────┼───┼─────┼───┼───────────┤
│ 1   │ 1 │ 1   │ 2 │ 0.140573  │
│ 2   │ 1 │ 1   │ 2 │ -1.87573  │
│ 3   │ 2 │ 2   │ 1 │ -0.793972 │
│ 4   │ 2 │ 2   │ 1 │ -0.339131 │
│ 5   │ 3 │ 3   │ 2 │ -1.68218  │
│ 6   │ 3 │ 3   │ 2 │ 0.169798  │
│ 7   │ 4 │ 4   │ 1 │ -0.111596 │
│ 8   │ 4 │ 4   │ 1 │ 0.353701  │

Probably the fact that by separates by the given columns, does the transformation, and then combines. From the docstring,

For a DataFrame, cols are combined along columns with the resulting DataFrame.

so the no-op would be

by(df, [:a], d -> d[:, [:b, :c]])

piever · May 31, 2018, 10:11am

In JuliaDB the default is to only pass the “non grouping” columns to the anonymous function, but I’m not sure which if there are strong reasons in favor of one option or the other. Otherwise, having some syntactic sugar to select everything except the grouping columns may help.

tk3369 · May 31, 2018, 4:32pm

It was a surprise to me because I thought the semantic is to summarize data that fits in a single cell for the group. In the above example, I was expecting only 4 rows to be returned and the new column will contain elements that are individual data frames.

I like that fact that the new column can contains objects e.g. passing x -> size(x) would return a tuple type in that column. The current behavior seems inconsistent when the function returns a DataFrame.

ExpandingMan · May 31, 2018, 4:50pm

I think you’re only seeing this awkward thing with the column names. The problem is that by always tries to construct a dataframe with the columns that were grouped by, so if you try to return a DataFrame for each group, it will wind up trying to duplicate those columns. There really should be a keyword argument that tells it not to try to include the grouped by columns, perhaps it’s worth thinking about more carefully and opening an issue.

piever · May 31, 2018, 5:31pm

Just for reference, the JuliaDB implementation uses the select keyword for this. groupby(df, by, select = Not(by)) is the default and groupby(df, by, select = All()) would select all columns. To use grouping columns in the inner function the recommended approach is (keep in mind that groupby in JuliaDB is the same as by in DataFrames):

groupby(df, by, usekey=true) do key, dd
....
end

Tamas_Papp · June 1, 2018, 9:17am

I think this should be the default.

nalimilan · June 1, 2018, 10:01am

That makes sense. Feel free to file an issue/PR. It shouldn’t be hard to do, but we’d better have a quick look at what dplyr and Pandas do too.

ExpandingMan · June 1, 2018, 2:00pm

I’ve thought about this a bit more. I think that what we should aim for is for

by(identity, df, cols)

to return the original DataFrame except possibly for ordering. This would also agree with @Tamas_Papp suggestion that the default behavior is not to try to insert the grouped by columns.

I’ll try to make a PR, but I’m not sure when I’ll get to it.

Topic		Replies	Views
Data Cleaning: Split, Combine, Apply? New to Julia dataframes	9	787	October 28, 2021
DataFrame by new columns containing arrays Data question	13	815	March 29, 2020
Stack overflow in DataFrames group by Data	16	4018	October 15, 2017
DataFrame Groupby New to Julia dataframes	2	2148	April 26, 2018
Group DataFrames by a function of a column Data package	4	1204	December 11, 2019

DataFrame `by` function error

Related topics