GroupedErrors.jl is a package meant to simplify the analysis and visualization of grouped data. Grouped data is particularly common in several fields, experimental psychology and behavioural neuroscience being the ones I’m most familiar with. For example one could have data from an experiment with several subjects and many measurements from each subject. It seems that two types of plots are particularly common in this scenario:
1. Subject by subject plot (generally a scatter plot)
Some simple summary statistics are computed for each experimental subject (mean and s.e.m of some quantity) and then plotted against some other summary statistics, potentially splitting by some categorical experimental variable.
2. Population plot (generally a ribbon plot in continuous case, or bar plot in discrete case)
Some statistical analysis is computed at the single subject level (for example the density/hazard/cumulative of some variable, or the expected value of a variable given another) and the analysis is summarized across subjects (taking for example mean and s.e.m), potentially splitting by some categorical experimental variable.
In both cases, the categorical experimental variables used to split the data could be represented in the plot as some plot attributes (solid vs dashed line, color, markershape).
The GroupedErrors package provides some macros to create with a simple command a statistical object (ProcessedTable
) that corresponds to these types of analysis and can be easily plotted using the @plot
macro. Plotting is implemented via the Plots.jl package, and all the Plots attributes can be used in combination with this analysis, also as a function of the categorical variables used to split the data.
Here are a couple of examples of plots of type 1. and 2. @splitby
chooses which variables(s) will be used to split the data into different plot traces. @across
is used to specify the “population” variable, @x
and @y
allow to select what will be on the x
and y
axis, and @set_attr
can be used to add a plot attribute that is a function of the splitting variables. All of these macros take as input an anonymous function (expressed with the _
syntax), so you can also use some variable that is not a column of your data (i.e. @x _.MAch - _.SSS
would also be acceptable).
using GroupedErrors
using DataFrames, RDatasets, Plots
school = RDatasets.dataset("mlmRev","Hsb82")
@> school begin
@splitby _.Sx
@across _.School
@x _.MAch
@y _.SSS
@plot scatter(legend = :topleft)
end
@> school begin
@splitby (_.Minrty, _.Sx)
@across _.School
@set_attr :linestyle _[1] == "Yes" ? :solid : :dash
@set_attr :color _[2] == "Male" ? :black : :blue
@x _.CSES
@y :density bandwidth = 0.2
@plot
end
The package accepts any IterableTable as input (DataFrame, IndexedTable, CSV source etc.) and is compatible with the standalone Query macros for data selection and preprocessing.
Missing data is supported: all the rows that have missing data in a column that is relevant for the analysis will be excluded.
It is possible that the package is missing some very common analysis for population data, in which case don’t hesitate to open a “feature request” issue and I’ll look into it.
The package is actually not released yet, but I’ve opened a PR in METADATA a few days ago, so I believe it should be released shortly. Until that happens, you can simply install it by typing:
Pkg.clone("https://github.com/piever/GroupedErrors.jl.git")
in the Julia REPL.
For more details please refer to the README.
A GUI based on QML and the GR Plots.jl backend is in the works to simplify the use of this package even further by allowing the analysis to be chosen just by clicking on some widgets. It is recommended for:
- users not very comfortable with coding
- completely exploratory data analysis on a dataset with a large number of columns where doing all plots by hand would be too time consuming
The GUI is not quite finished yet so I’ll post a separate announcement when it is.