(crossposting from S.O.: http://stackoverflow.com/questions/42815117/a-generic-sreadsheet-like-pivot-function-for-julia-dataframes)
I implemented a generic pivot function for Julia DataFrames, with some of the nice bells and wishes you have in LibreOffice/Excel pivot tables.
In particular the function:
- allows multiple column fields;
- is able to filter column and/or row fields;
- allows to specify the operations to perform on data (sum, averageβ¦);
- provides basic sorting
I am however stock with allowing multiple column indices, and I wonder if that is possible at all with the current Julia DataFrames where, differently from Pandas, levels seems not to be implemented.
I understand that each of that functionalities (in particular filtering and sorting) can be better achieved with multiple passages, but I think it would be nice for users to have a single function that provides basic pivot functionality (while unpivot() at the other extreme doesnβt allow multiple row columns nor to specify how to aggregate) similar to what Excel/LibreOffice does.
Do you think would be worth to propose a push of it in the DataFrames package ?
EDITED 19/3/2017: accepts rowFields, ops and sort both as arrays or as scalars
EDITED 17/3/2017: added multiple operations and removed eval(symbol)
"""
Pivot and optionally filter and sort in a single function
```julia
pivot(df::AbstractDataFrame, rowFields, colField, valuesField; <keyword arguments>)
Arguments
-
df::AbstractDataFrame
: the original dataframe, in stacked version (dim1,dim2,dim3β¦ value) -
rowFields
: the field(s) to be used as row categories (also known as IDs or keys) -
colField::Symbol
: the field containing the values to be used as column headers -
valuesField::Symbol
: the column containing the values to reshape -
ops=sum
: the operation(s) to perform on the data, default on summing them (see notes) -
filter::Dict
: an optional filter, in the form of a dictionary of column_to_filter => [list of ammissible values] -
sort
: optional row field(s) to sort (see notes)
Notes
- ops can be any supported Julia operation over a single array, for example:
sum
,mean
,length
,countnz
,maximum
,minimum
,var
,std
,prod
.
Multiple operations can be specified using an array, and in such case an additional column is created to index them. - filters are optional. Only
in
filter is supported. - sort is possible only for row fields. You can specify reverse ordering for any column passing a touple (:colname, true) instead of just :colname.
You can pass multiple columns to be sorted in an array, e.g. [(:col1,true),:col2].
Examples
julia> df = DataFrame(region = ["US","US","US","US","EU","EU","EU","EU","US","US","US","US","EU","EU","EU","EU"],
product = ["apple","apple","banana","banana","apple","apple","banana","banana","apple","apple","banana","banana","apple","apple","banana","banana"],
year = [2010,2011,2010,2011,2010,2011,2010,2011,2010,2011,2010,2011,2010,2011,2010,2011],
produced = [3.3,3.2,2.3,2.1,2.7,2.8,1.5,1.3, 4.3,4.2,3.3,2.3,3.7,3.8,2.0,3.3],
consumed = [4.3,7.4,2.5,9.8,3.2,4.3,6.5,3.0, 5.3,7.4,3.5,9.8,4.2,6.3,8.5,4.0],
category = ['A','A','A','A','A','A','A','A', 'B','B','B','B','B','B','B','B',])
julia> longDf = stack(df,[:produced,:consumed])
julia> pivDf = pivot(longDf, [:product, :region,], :year, :value,
ops = [mean, var],
filter = Dict(:variable => [:consumed]),
sort = [:product, (:region, true)]
)
8Γ5 DataFrames.DataFrame
β Row β product β region β op β 2010 β 2011 β
βββββββΌβββββββββββΌβββββββββΌβββββββββΌβββββββΌβββββββ€
β 1 β "apple" β "US" β "mean" β 4.8 β 7.4 β
β 2 β "apple" β "US" β "var" β 0.5 β 0.0 β
β 3 β "apple" β "EU" β "mean" β 3.7 β 5.3 β
β 4 β "apple" β "EU" β "var" β 0.5 β 2.0 β
β 5 β "banana" β "US" β "mean" β 3.0 β 9.8 β
β 6 β "banana" β "US" β "var" β 0.5 β 0.0 β
β 7 β "banana" β "EU" β "mean" β 7.5 β 3.5 β
β 8 β "banana" β "EU" β "var" β 2.0 β 0.5 β
ββ"
function pivot(df::AbstractDataFrame, rowFields, colField::Symbol, valuesField::Symbol; ops=sum, filter::Dict=Dict(), sort=)
for (k,v) in filter
df = df[ [i in v for i in df[k]], :]
end
sortv = []
sortOptions = []
if(isa(sort, Array))
sortv = sort
else
push!(sortv,sort)
end
for i in sortv
if(isa(i, Tuple))
push!(sortOptions, order(i[1], rev = i[2]))
else
push!(sortOptions, order(i))
end
end
catFields::AbstractVector{Symbol} = cat(1,rowFields, colField)
dfs = DataFrame[]
opsv =[]
if(isa(ops, Array))
opsv = ops
else
push!(opsv,ops)
end
for op in opsv
dft = by(df, catFields) do df2
a = DataFrame()
a[valuesField] = op(df2[valuesField])
if(length(opsv)>1)
a[:op] = string(op)
end
a
end
push!(dfs,dft)
end
df = vcat(dfs)
df = unstack(df,colField,valuesField)
sort!(df, cols = sortOptions)
return df
end