My understanding is that function f will get an AbstractDataFrame and I can do whatever with it. However, it fails when I just return the input. Why? What am I missing?
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8));
by(df, :a, d -> d)
WARNING: Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically.
Stacktrace:
[1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
[2] #add_names#18(::Bool, ::Function, ::DataFrames.Index, ::DataFrames.Index) at /Users/tomkwong/.julia/v0.6/DataFrames/src/other/index.jl:190
[3] (::DataFrames.#kw##add_names)(::Array{Any,1}, ::DataFrames.#add_names, ::DataFrames.Index, ::DataFrames.Index) at ./<missing>:0
[4] #hcat!#71(::Bool, ::Function, ::DataFrames.DataFrame, ::DataFrames.DataFrame) at /Users/tomkwong/.julia/v0.6/DataFrames/src/dataframe/dataframe.jl:840
[5] (::DataFrames.#kw##hcat!)(::Array{Any,1}, ::DataFrames.#hcat!, ::DataFrames.DataFrame, ::DataFrames.DataFrame) at ./<missing>:0
[6] combine(::DataFrames.GroupApplied{DataFrames.SubDataFrame{Array{Int64,1}}}) at /Users/tomkwong/.julia/v0.6/DataFrames/src/groupeddataframe/grouping.jl:202
[7] by(::DataFrames.DataFrame, ::Symbol, ::Function) at /Users/tomkwong/.julia/v0.6/DataFrames/src/groupeddataframe/grouping.jl:293
In JuliaDB the default is to only pass the βnon groupingβ columns to the anonymous function, but Iβm not sure which if there are strong reasons in favor of one option or the other. Otherwise, having some syntactic sugar to select everything except the grouping columns may help.
It was a surprise to me because I thought the semantic is to summarize data that fits in a single cell for the group. In the above example, I was expecting only 4 rows to be returned and the new column will contain elements that are individual data frames.
I like that fact that the new column can contains objects e.g. passing x -> size(x) would return a tuple type in that column. The current behavior seems inconsistent when the function returns a DataFrame.
I think youβre only seeing this awkward thing with the column names. The problem is that by always tries to construct a dataframe with the columns that were grouped by, so if you try to return a DataFrame for each group, it will wind up trying to duplicate those columns. There really should be a keyword argument that tells it not to try to include the grouped by columns, perhaps itβs worth thinking about more carefully and opening an issue.
Just for reference, the JuliaDB implementation uses the select keyword for this. groupby(df, by, select = Not(by)) is the default and groupby(df, by, select = All()) would select all columns. To use grouping columns in the inner function the recommended approach is (keep in mind that groupby in JuliaDB is the same as by in DataFrames):
Iβve thought about this a bit more. I think that what we should aim for is for
by(identity, df, cols)
to return the original DataFrame except possibly for ordering. This would also agree with @Tamas_Papp suggestion that the default behavior is not to try to insert the grouped by columns.
Iβll try to make a PR, but Iβm not sure when Iβll get to it.