Graphical Exploratory Data Analysis in Julia with VegaLite/Gadfly/Plots

I use R quite a bit for exploratory data analysis, data munging, visualization, and model fitting. I’m trying to simultaneously teach myself how to use Julia in this context, and write some tutorials.

Here are some typical plots I might try to make in R:

  1. A point scatterplot of y vs x, with a smoothed trend line over the top, in R you can produce this easily with ggplot(data,aes(x,y)) + geom_point() + geom_smooth()

  2. A histogram with a smoothed Kernel Density estimate superimposed: ggplot(data,aes(x))+geom_histogram()+geom_density()

  3. A 2D heatmap ggplot(data,aes(x,y))+geom_tile()

  4. A 2D point scatter plot of a large number of points (thousands) using alpha to represent density: ggplot(data,aes(x,y))+geom_point(alpha=0.1)

  5. Counts of how many of each category there is in a dataset with a categorical variable x (a “factor” in R) ggplot(data,aes(x))+geom_bar()

  6. Point + interval plots as summaries of posterior samples from a Bayesian model…

I’d like to figure out how to reproduce something like those examples in Vegalite. I choose Vegalite because it seems to be oriented around the composability idea, and has fairly beautiful output.

Anyone with any experience in VegaLite who would be interested in helping me figure out how to, or have some resources to point me at?

1 Like

It seems like the vegalite package offers more or less a direct translation of the JSON spec for vegalite, but does not offer higher level constructors. Is that more or less correct? It seems like there is room for some sort of type hierarchy that specifies various layers like histograms, curves, KDEs, polygons, points, bars, etc, and then have a Vegalite.VLSpec constructor that takes a sequence of these types and builds a corresponding specification. Does that make some kind of sense? Is that something that’s already planned @davidanthoff?

You may want to take a look at Gadfly.jl. AFAIK, it has the closest syntax to ggplot’s grammar of graphics.

Ggplot(ly) is actually the last remaining reason I keep R installed, I haven’t yet found a perfect replacement for ggplot’s syntax and plotly’s interactivity. A recent project has had me struggling with ggplot’s poor performance, so I’m going to try to make the switch to PlotlyJS.jl. I may end up using a combination of that and Gadfly.

I will try out Gadfly. it’s been on my radar and it’s pretty cool that it’s 100% Julia with rendering to svg. On the other hand vegalite has a big vis group doing development behind it so it would be nice to take advantage of that, if there were a sufficiently high level interface to produce plots. It’s hard enough to figure out what’s going on in your data or models… also having to hand hold the vis library through a JSON spec is not going to be friendly to data analysts.

but it seems like it wouldn’t be too hard to have a kind of compiler that takes a bunch of special structs and grinds out the JSON.

There’s also the Plots.jl ecosystem, which leverages Julia’s type system to define “recipes” for plotting all sorts of data structures, and its extension StatsPlots which offers I think all of the things on your list. Look at the readme here: https://github.com/JuliaPlots/StatsPlots.jl

It’s not clear to me how composable the plots/statsplots are. For example, is it possible to do a heatmap and then overlay a specific subset of points, and then overlay 10 posterior samples from a smoother?

And I don’t mean, is it possible to create a type called “HeatMapPlusPointsAndSmoother” and then create a recipe for it… I mean, can I call for a heat map, and then take the output of that and add to it a points plot and add to that a set of curves?

Composability of a graphics system is key in my mind for it to make sense in a data analysis setting.

VegaLite has the explicit notion of a layer built into the underlying library, so this facilitates composability quite a bit. Gadfly also seems to have stacking and layers built in as primitive. I guess I’ll do some trials with Gadfly.

I don’t think that’s a problem (although I’m not a VegaLite user, mainly because I’m a Matlab convert so never felt comfortable with Grammar of Graphics).

I’ll admit that the following is uniquely ugly, but it should give an idea of the kind of composability you describe - basically just use plotting functions with a ! and they’ll add a layer to the existing plot:

julia> using Plots

julia> heatmap(["1","2","3"],["1","2","3"],[1 2 3;2 3 1;3 1 1])

julia> scatter!(rand(0:0.01:3, 10), rand(0:0.01:3, 10))

julia> histogram!(rand(0:0.01:3, 1_000), alpha = 0.5, normalize = true)

image

Also just to add, I think your HeatMapPlusPointsAndSmoother isn’t quite what the recipes system is trying to do - you don’t create artificial types to then design recipes for them, but rather you create recipes for existing types that have relevance outside a plotting context (e.g. a type that describes a stochastic differential equation and its solution, or a type that describes a probability distribution). The recipes then provide information to Plots.jl (or indeed other plotting packages that implement recipes) about how an instance of this type should be plotted.

Here’s a quick Gadfly example of the plots in the OP:

p1 = plot(df, x=:x, y=:y, alpha=[0.5], color=["Point"],
    layer(x=:x, y=:y, Geom.smooth(smoothing=0.3), color=fill("Smooth", n)),
    Geom.point)
p2 = plot(df, x=:x, Geom.density(bandwidth=0.1), color=["bw=0.1"],
    layer(x=:x, Geom.density(bandwidth=0.25), color=["bw=0.25"]))
p3 = spy(reshape(x[1:100],10,10), Scale.x_discrete, Scale.y_discrete)
p4 = plot(df, x=:x, y=:y2, alpha=[0.1], Geom.point, Theme(highlight_width=0mm))
p5 = plot(df, x=:group, color=:group, Geom.histogram)
p6 = plot()


There are obvious improvements tbd e.g. allow color=["Smooth"] in plot p1. For p3, there is also Heatmap.jl, which is a work in progress. For p6, we don’t specifically have a stat to do pointintervals (there is Geom.errorbar), but you can write custom statisitcs and guides (and Heatmap.jl is an example of that). See NEWS.md for what’s in the next release (imminent).

1 Like

thanks very much for the Gadfly examples! I think I will focus on Gadfly for my data analysis tutorials.

Also thanks @nilshg for winning both the “helpful example” award and the “most ugly plot” award :slight_smile:

also re:

Yes, and that makes good sense. I think your example shows that you don’t want to represent composite plots by types (it’s a combinatorial nightmare!), you want plots for specific meaningful types and then composition ability of those plots. Composition is a really essential component of graphics because you often want to show how several things work together. For example, you might have some points, and then two ways of smoothing, so you want to be able to plot the points, plot smoother type 1 and then plot smoother type 2 and show how they behave differently/similarly. Or you might have a heatmap from data, plus an analytical surface that you want contour plotted onto the heatmap… or whatever. If you had to create a type for each composition it’d be a nightmare. Thanks for your example, since it shows you don’t need special “composite types”, you can just plot! over the top.

1 Like

Hey @Mattriks, thanks again for those examples. I’m trying to do some plots and sometimes you just want to control colors yourself. Is it possible to control the color of say a point, or the fill underneath a density curve by specifying a color, rather than specifying it as an aesthetic mapped to a data value?

like plot(…,color = “red”) and not have it look for a column named “red” in the dataset but instead just use the color red?

Also is it possible to specify a computed value without computing it and adding it to the dataset, like suppose for example I have a column called “value” and I want to plot value^2?

plot(mydata,x=:time,y=:value^2)

doesn’t work of course, because you can’t square the symbol value. I’m down with that, but can I supply a function like y = z -> z.value^2 or something similar?

Also, there appears to be something about Gadfly that holds back Turing so recent versions can’t be installed. Is that likely to change soonish at the next release you mentioned?

Re functions, see Plotting in the manual, and in the Gallery Stat.func.

Re color, there’s been an update recently, so to make that work now you need to do ]add Gadfly#master. As hinted above, the next release is imminent. Also see issue #1430.

That’s great about plotting functions, but what I meant was plotting transformed versions of the data without having to compute the transform into the data set… so for example:


d=DataFrame(x=[i for i in 1:10],y=[i^2 for i in 1:10])

plot(d,x=:x,y=:y) will plot a quadratic… but I want to show that if you take the sqrt(y) you’ll get a line…

plot(d,x=:x,y=...., Geom.point)

for something going … which would give the same result as:

d.sqrty = sqrt(d.y); plot(d,x=:x,y=:sqrty,Geom.point)

So plotting a set of points that are the result of applying some function to the values in each row…

Obviously, I can always do d |> @map(....) |> plot(...) to create the data mappings. I just wondered if there was some built in plotting that could describe a more arbitrary mapping than 1-1 from an axis to a single column of the dataset.

See the Tutorial e.g. Scale.y_sqrt.

So that sounds like a few common rescalings, but suppose I want something like divide everything by sqrt(pi*n), or calculate my own special function f like y = f(a,b) for two columns a, and b?

if you’re familiar with ggplot2, I’m thinking something like aes(x,(a^2+b^2)/sqrt(2*pi*n)) where a,b,n are all fields in the dataset?

NOTE: It’s fine if this kind of thing isn’t possible. hey query makes it easy right? I just am wondering what is possible.

In many ways it’s better if the answer is just “precompute the values” because then there aren’t too many ways to skin a cat so to speak

Not possible yet in Gadfly, so “precompute the values” :grinning:

That is roughly correct. The general idea is that vega-lite is already a higher level layer on top of vega itself. We do have a number of syntactic shortcuts over the pure JSON case, described here that often make the VegaLite.jl version of a plot significantly more concise than the corresponding vega-lite JSON. But in general we do want to be close to the underlying vega-lite story.

So this is painful right now :slight_smile: The point part is simple, i.e. data |> @vlplot(:point, :x, :y), but doing a regression or trend line is very verbose, see here.

There is an open issue on the vega-lite repo to add a macro for this kind of situation here. My expectation is that once that is implemented we’ll be able to just do something like data |> @vlplot(x=:x, y=:y) + @vlplot(:point) + @vlplot(:smooth).

This is also a pretty good example why I’m hesitant to add special high-level constructs to VegaLite.jl for this kind of situation: I think the proper way to add this functionality is in the underlying vega-lite library, and then it automatically surfaces for the Julia wrapper.

Doing a histogram by itself is easy:

dataset("movies") |>
@vlplot(:bar, :IMDB_Rating, "count()")

Doing a density plot is relatively easy:

dataset("movies") |>
@vlplot(transform=[{density=:IMDB_Rating}], mark=:line, x="value:q", y="density:q")

Putting them together is a pain because there is no easy way to generate a normalized histogram that I could find, so the y axis of the two individual plots don’t match up nicely. I’ll try to raise this with the vega-lite folks, this seems a pretty common kind of plot that shouldn’t be so complicated.

I think there might also be room for a vega-lite macro here that would allow one to do the density without specifying a manual transform.

Something like this:

dataset("movies") |>
@vlplot(
    :rect,
    x={:IMDB_Rating, bin={maxbins=60}},
    y={:Rotten_Tomatoes_Rating, bin={maxbins=40}},
    color="count()"
)

Agreed that this is still a bit too verbose, it would be nice if it did something reasonable without the need to specify the maxbins

dataset("movies") |>
@vlplot({:circle, opacity=0.1},:IMDB_Rating,:Rotten_Tomatoes_Rating)

I think this is what you want:

dataset("movies") |>
@vlplot(:bar, x=:Major_Genre, y="count()")

right?

Not exactly sure what you are after, but maybe some of the examples in https://www.queryverse.org/VegaLite.jl/stable/examples/examples_error_bars_bands/ are what you are looking for?

In VegaLite.jl you can use the compute transform to do pretty arbitrary computations, see here. Of course, I’m always in favor of using Quer.jl, though :wink: I think computations inside the vega-lite spec make most sense if you have an interactive plot and the computation needs to update in response to the user selections. That scenario is difficult to handle with precomputed transformations, but should work with native vega-lite transforms.

One other thing to point out about VegaLite.jl is that there we have DataVoyager.jl as an interactive UI for data exploration that works hand-in-hand with VegaLite.jl (and we have a couple more similar interactive tools in the pipeline, they are just not ready yet). Not sure how useful that would be for your particular scenario, though.

Wow, so Vega/Vegalite is doing its own regression/loess/etc calculations. That’s interesting.

Help me understand what this does. clearly it says you want a :bar plot of the :IMDB_Rating column, but then you pass a string “count()” which I assume means you’re asking vegalite to use a function called count() in javascript. I think it’s the “I don’t know what to do in javascript vs what to do in julia” aspect of this interface that confuses me the most. I assume maybe we could precompute the histogram in Julia in a normalized form, and then pass the data for a bar plot. Then this wouldn’t be so convoluted. But it also wouldn’t be a direct javascript translation anymore either…

How about composing plots from Julia? Is there some composition type operator, like + which would let you take two different vegalite plots and combine them together to get a layered plot? Or do you always have to do one plot with two layers?

See here, this is syntax to add an aggregation to an encoding definition. It is not embedded JavaScript. I think, in fact, that the only place where one would ever use JavaScript when using VegaLite.jl, is in the specification of calculated fields in transforms, nowhere else is JavaScript exposed to the Julia user. Then I also use positional arguments here, i.e. the first argument is the mark, the second the x and the third the y encoding channel. I could have written the same thing out in full as:

@vlplot(mark={type=:bar}, x={field=:IMDB_Rating}, y={aggregation=:count})

This is an example of a histogram that is precomputed, but I don’t think that makes things more complicated.

Yes, you can compose plots, but I haven’t written that documentation chapter yet. Mostly because I’m still not a 100% satisfied with what I’ve come up with so far. But the short of it is this:

  • You can concatenate with brackets: [p1 p2] and [p1; p2] for example
  • You layer with +, but the first plot will not form a layer but instead have properties that affect the whole plot. Say p1 + p2 + p3 will create a layered plot with two layers (p2 and p3), and p1 will have properties that affect the whole plot.
  • If you do p1 + p2 and p1 is of a type that requires a sub-spec, then p2 will become that sub spec.

Thanks @davidanthoff for explanations and examples. somehow I had missed your gallery of advanced examples. While I can see the appeal of keeping VegaLite close to the metal of the VegaLite library so as to continuously reap the improvements to that library… I am torn with what I think of as a rather verbose syntax to do relatively common things. Perhaps what is needed is an EDA toolkit of opinionated macros that “compile” to vegalite specs… as a secondary Package on top of vegalite… I’m thinking it’d be nice to say something like

@ggvl(histogram(:height, fill=:lightblue)+ density(:height,fill=:green,alpha=0.25)+background(color=:grey)+title("Histogram and Density Plot of Human Heights"))

I think the best solution would be if vega-lite (the JavaScript library) added macros for histograms and densities, like they did for other common things previously. If we had that, then the existing Julia wrapper would allow us to just write your example as:

@vlplot(x=:height, background=:grey, title="Bla Bla") +
  @vlplot({:histogram, color=:lightblue}) +
  @vlplot({:density, color=:green, opacity=0.25})

Or something similar to that. I feel that is similarly concise to your suggestion? The big benefit would be that it wouldn’t require any additional “ideas” on the Julia side of things. The logic here would be that the first @vlplot call defines properties for the whole plot, and the next two @vlplot calls each become one layer in a layered plot.

Having said that, I’ve also started a new higher level Julia package, like what you suggested, see here for a discussion, and the package is QuickVega.jl (but be warned, the package is empty at this point). But there is a crucial difference to your suggestion: in my mind QuickVega.jl will not have a grammar of graphics style API, instead it will just be a “one function -> one figure” type API. But it will be possible to combine the plots that you get via that route with “normal” VegaLite.jl plots via composition.