A spreadsheet-like pivot function

proposal

#1

(crossposting from S.O.: http://stackoverflow.com/questions/42815117/a-generic-sreadsheet-like-pivot-function-for-julia-dataframes)

I implemented a generic pivot function for Julia DataFrames, with some of the nice bells and wishes you have in LibreOffice/Excel pivot tables.
In particular the function:

  1. allows multiple column fields;
  2. is able to filter column and/or row fields;
  3. allows to specify the operations to perform on data (sum, average…);
  4. provides basic sorting

I am however stock with allowing multiple column indices, and I wonder if that is possible at all with the current Julia DataFrames where, differently from Pandas, levels seems not to be implemented.

I understand that each of that functionalities (in particular filtering and sorting) can be better achieved with multiple passages, but I think it would be nice for users to have a single function that provides basic pivot functionality (while unpivot() at the other extreme doesn’t allow multiple row columns nor to specify how to aggregate) similar to what Excel/LibreOffice does.

Do you think would be worth to propose a push of it in the DataFrames package ?

EDITED 19/3/2017: accepts rowFields, ops and sort both as arrays or as scalars
EDITED 17/3/2017: added multiple operations and removed eval(symbol)

"""
Pivot and optionally filter and sort in a single function

```julia
pivot(df::AbstractDataFrame, rowFields, colField, valuesField; <keyword arguments>)

Arguments

  • df::AbstractDataFrame: the original dataframe, in stacked version (dim1,dim2,dim3… value)
  • rowFields: the field(s) to be used as row categories (also known as IDs or keys)
  • colField::Symbol: the field containing the values to be used as column headers
  • valuesField::Symbol: the column containing the values to reshape
  • ops=sum: the operation(s) to perform on the data, default on summing them (see notes)
  • filter::Dict: an optional filter, in the form of a dictionary of column_to_filter => [list of ammissible values]
  • sort: optional row field(s) to sort (see notes)

Notes

  • ops can be any supported Julia operation over a single array, for example: sum, mean, length, countnz, maximum, minimum, var, std, prod.
    Multiple operations can be specified using an array, and in such case an additional column is created to index them.
  • filters are optional. Only in filter is supported.
  • sort is possible only for row fields. You can specify reverse ordering for any column passing a touple (:colname, true) instead of just :colname.
    You can pass multiple columns to be sorted in an array, e.g. [(:col1,true),:col2].

Examples

julia> df = DataFrame(region   = ["US","US","US","US","EU","EU","EU","EU","US","US","US","US","EU","EU","EU","EU"],
                      product  = ["apple","apple","banana","banana","apple","apple","banana","banana","apple","apple","banana","banana","apple","apple","banana","banana"],
                      year     = [2010,2011,2010,2011,2010,2011,2010,2011,2010,2011,2010,2011,2010,2011,2010,2011],
                      produced = [3.3,3.2,2.3,2.1,2.7,2.8,1.5,1.3,  4.3,4.2,3.3,2.3,3.7,3.8,2.0,3.3],
                      consumed = [4.3,7.4,2.5,9.8,3.2,4.3,6.5,3.0,  5.3,7.4,3.5,9.8,4.2,6.3,8.5,4.0],
                      category = ['A','A','A','A','A','A','A','A', 'B','B','B','B','B','B','B','B',])
julia> longDf = stack(df,[:produced,:consumed])
julia> pivDf  = pivot(longDf, [:product, :region,], :year, :value,
                      ops    = [mean, var],
                      filter = Dict(:variable => [:consumed]),
                      sort   = [:product, (:region, true)]
                     )
 8Γ—5 DataFrames.DataFrame
 β”‚ Row β”‚ product  β”‚ region β”‚ op     β”‚ 2010 β”‚ 2011 β”‚
 β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
 β”‚ 1   β”‚ "apple"  β”‚ "US"   β”‚ "mean" β”‚ 4.8  β”‚ 7.4  β”‚
 β”‚ 2   β”‚ "apple"  β”‚ "US"   β”‚ "var"  β”‚ 0.5  β”‚ 0.0  β”‚
 β”‚ 3   β”‚ "apple"  β”‚ "EU"   β”‚ "mean" β”‚ 3.7  β”‚ 5.3  β”‚
 β”‚ 4   β”‚ "apple"  β”‚ "EU"   β”‚ "var"  β”‚ 0.5  β”‚ 2.0  β”‚
 β”‚ 5   β”‚ "banana" β”‚ "US"   β”‚ "mean" β”‚ 3.0  β”‚ 9.8  β”‚
 β”‚ 6   β”‚ "banana" β”‚ "US"   β”‚ "var"  β”‚ 0.5  β”‚ 0.0  β”‚
 β”‚ 7   β”‚ "banana" β”‚ "EU"   β”‚ "mean" β”‚ 7.5  β”‚ 3.5  β”‚
 β”‚ 8   β”‚ "banana" β”‚ "EU"   β”‚ "var"  β”‚ 2.0  β”‚ 0.5  β”‚

β€œβ€"
function pivot(df::AbstractDataFrame, rowFields, colField::Symbol, valuesField::Symbol; ops=sum, filter::Dict=Dict(), sort=[])

for (k,v) in filter
  df = df[ [i in v for i in df[k]], :]
end

sortv = []
sortOptions = []
if(isa(sort, Array))
    sortv = sort
else
    push!(sortv,sort)
end
for i in sortv
    if(isa(i, Tuple))
      push!(sortOptions, order(i[1], rev = i[2]))
    else
      push!(sortOptions, order(i))
    end
end

catFields::AbstractVector{Symbol} = cat(1,rowFields, colField)
dfs  = DataFrame[]
opsv =[]
if(isa(ops, Array))
    opsv = ops
else
    push!(opsv,ops)
end

for op in opsv
    dft = by(df, catFields) do df2
        a = DataFrame()
        a[valuesField] = op(df2[valuesField])
        if(length(opsv)>1)
            a[:op] = string(op)
        end
        a
    end
    push!(dfs,dft)
end

df = vcat(dfs)
df = unstack(df,colField,valuesField)
sort!(df, cols = sortOptions)
return df

end


#2

Interesting. DataFrames and DataTables do not support multiple column indices on purpose. This feature isn’t generally supported by database-like structures (AFAIK Pandas is an exception). If you need these, maybe you shouldn’t store the result in a DataFrame but in a different structure like a NamedArray an AxisArray or an IndexedTable. With a DataFrame, you would need one row for each combination of indices, with one column for each index giving the level (this is the principle of tidy data).

Regarding the implementation, you could probably use aggregate. Also, you don’t need to call eval(op): just pass the function instead of its name, i.e. sum instead of :sum.


#3

Hi @nalimilan … thanks for your comments.

I am using by/do instead of aggregate because otherwise I would have first to remove all the columns that are not either categories nor the values on which I want to operate.

From the example I am using:

df = DataFrame(region   = ["US","US","US","US","EU","EU","EU","EU"],
                      product  = ["apple","apple","banana","banana","apple","apple","banana","banana"],
                      year     = [2010,2011,2010,2011,2010,2011,2010,2011],
                      produced = [3.3,3.2,2.3,2.1,2.7,2.8,1.5,1.3],
                      consumed = [4.3,7.4,2.5,9.8,3.2,4.3,6.5,3.0])
longDf = stack(df,[:produced,:consumed])
# longDf: :variable  :value  :region  :product  :year
a = aggregate(longDf,[:variable,:year],sum)

This will return in an error because aggregate tries to apply the aggregation function toward all the non-ids columns.
With by/do I can instead specify which columns are ids, which column contains the values over to perform the aggregation, whenever there are extra cols or not in the db.
But something I could do is allow multiple operations, and if there is more than one, add a new column with the name of the operation.
Concerning that, which type declaration should I use in the parameters? E.g. if I expect a :sum or a :prod I can just use Symbol, but if I expect sum or pro what can I use? typeof() return me Base.#sum and Base.#pro… have them a common supertype I can use in the parameter declaration of the function for pro, sum , etc… ?


#4

OK, I see.

Just use ::Any, since any type may be callable in Julia. You could use ::Function to accept only functions, but that would unnecessarily restrict the accepted operations.


#5

Have you thought perhaps this function would go well in a β€œSpreadsheet” package for Julia that contains a collection of functions found on a spreadsheet. This would be aimed at users like me who are moving over from spreadsheets into Julia and would appreciate a conceptual bridge to what you can conceptually do on a spreadsheet but in Julia.
I am sure this would go down well.