(crossposting from S.O.: http://stackoverflow.com/questions/42815117/a-generic-sreadsheet-like-pivot-function-for-julia-dataframes)
I implemented a generic pivot function for Julia DataFrames, with some of the nice bells and wishes you have in LibreOffice/Excel pivot tables.
In particular the function:
- allows multiple column fields;
- is able to filter column and/or row fields;
- allows to specify the operations to perform on data (sum, averageβ¦);
- provides basic sorting
I am however stock with allowing multiple column indices, and I wonder if that is possible at all with the current Julia DataFrames where, differently from Pandas, levels seems not to be implemented.
I understand that each of that functionalities (in particular filtering and sorting) can be better achieved with multiple passages, but I think it would be nice for users to have a single function that provides basic pivot functionality (while unpivot() at the other extreme doesnβt allow multiple row columns nor to specify how to aggregate) similar to what Excel/LibreOffice does.
Do you think would be worth to propose a push of it in the DataFrames package ?
EDITED 19/3/2017: accepts rowFields, ops and sort both as arrays or as scalars
EDITED 17/3/2017: added multiple operations and removed eval(symbol)
"""
Pivot and optionally filter and sort in a single function
```julia
pivot(df::AbstractDataFrame, rowFields, colField, valuesField; <keyword arguments>)
Arguments
- 
df::AbstractDataFrame: the original dataframe, in stacked version (dim1,dim2,dim3β¦ value)
- 
rowFields: the field(s) to be used as row categories (also known as IDs or keys)
- 
colField::Symbol: the field containing the values to be used as column headers
- 
valuesField::Symbol: the column containing the values to reshape
- 
ops=sum: the operation(s) to perform on the data, default on summing them (see notes)
- 
filter::Dict: an optional filter, in the form of a dictionary of column_to_filter => [list of ammissible values]
- 
sort: optional row field(s) to sort (see notes)
Notes
- ops can be any supported Julia operation over a single array, for example: sum,mean,length,countnz,maximum,minimum,var,std,prod.
 Multiple operations can be specified using an array, and in such case an additional column is created to index them.
- filters are optional. Only infilter is supported.
- sort is possible only for row fields. You can specify reverse ordering for any column passing a touple (:colname, true) instead of just :colname.
 You can pass multiple columns to be sorted in an array, e.g. [(:col1,true),:col2].
Examples
julia> df = DataFrame(region   = ["US","US","US","US","EU","EU","EU","EU","US","US","US","US","EU","EU","EU","EU"],
                      product  = ["apple","apple","banana","banana","apple","apple","banana","banana","apple","apple","banana","banana","apple","apple","banana","banana"],
                      year     = [2010,2011,2010,2011,2010,2011,2010,2011,2010,2011,2010,2011,2010,2011,2010,2011],
                      produced = [3.3,3.2,2.3,2.1,2.7,2.8,1.5,1.3,  4.3,4.2,3.3,2.3,3.7,3.8,2.0,3.3],
                      consumed = [4.3,7.4,2.5,9.8,3.2,4.3,6.5,3.0,  5.3,7.4,3.5,9.8,4.2,6.3,8.5,4.0],
                      category = ['A','A','A','A','A','A','A','A', 'B','B','B','B','B','B','B','B',])
julia> longDf = stack(df,[:produced,:consumed])
julia> pivDf  = pivot(longDf, [:product, :region,], :year, :value,
                      ops    = [mean, var],
                      filter = Dict(:variable => [:consumed]),
                      sort   = [:product, (:region, true)]
                     )
 8Γ5 DataFrames.DataFrame
 β Row β product  β region β op     β 2010 β 2011 β
 βββββββΌβββββββββββΌβββββββββΌβββββββββΌβββββββΌβββββββ€
 β 1   β "apple"  β "US"   β "mean" β 4.8  β 7.4  β
 β 2   β "apple"  β "US"   β "var"  β 0.5  β 0.0  β
 β 3   β "apple"  β "EU"   β "mean" β 3.7  β 5.3  β
 β 4   β "apple"  β "EU"   β "var"  β 0.5  β 2.0  β
 β 5   β "banana" β "US"   β "mean" β 3.0  β 9.8  β
 β 6   β "banana" β "US"   β "var"  β 0.5  β 0.0  β
 β 7   β "banana" β "EU"   β "mean" β 7.5  β 3.5  β
 β 8   β "banana" β "EU"   β "var"  β 2.0  β 0.5  β
ββ"
function pivot(df::AbstractDataFrame, rowFields, colField::Symbol, valuesField::Symbol; ops=sum, filter::Dict=Dict(), sort=)
for (k,v) in filter
  df = df[ [i in v for i in df[k]], :]
end
sortv = []
sortOptions = []
if(isa(sort, Array))
    sortv = sort
else
    push!(sortv,sort)
end
for i in sortv
    if(isa(i, Tuple))
      push!(sortOptions, order(i[1], rev = i[2]))
    else
      push!(sortOptions, order(i))
    end
end
catFields::AbstractVector{Symbol} = cat(1,rowFields, colField)
dfs  = DataFrame[]
opsv =[]
if(isa(ops, Array))
    opsv = ops
else
    push!(opsv,ops)
end
for op in opsv
    dft = by(df, catFields) do df2
        a = DataFrame()
        a[valuesField] = op(df2[valuesField])
        if(length(opsv)>1)
            a[:op] = string(op)
        end
        a
    end
    push!(dfs,dft)
end
df = vcat(dfs)
df = unstack(df,colField,valuesField)
sort!(df, cols = sortOptions)
return df
end