Failing to plot correlograms in Julia: Makie vs. AoG vs StatsPlots

@rafael.guerra Thanks for stimulating me to use Plotly!

GOOD: It worked fine and it is very fast. Best result so far.

NOT so GOOD: this is a general but minor problem I find with Plotly: it seems the relative labextension is out of sync with jupyterlab, which at times gives issues. I used vscode for this plot.

This is my plotting function (it needs a bit of refactoring as there are too many duplications in the parameter settings):

function correlogram(df)
    rows = cols = size(df,2)
    dcols = names(df)
    plots = []
    for row = 1:rows, col = 1:cols
        if col == 1 && row == 1
                push!(
                plots,
                histogram(df[:,row],bins=10, xtickfont = font(5), ytickfont = font(5), legend = false,  ylabel = dcols[row]))
        elseif  row == col && row < cols && row >1
                push!(
                plots,    
                histogram(df[:,row],bins=10, xtickfont = font(5), ytickfont = font(5), legend = false))
        elseif row < cols && col == 1
                push!(
                plots,
                scatter(df[:,row], df[:,col], xtickfont = font(5), ytickfont = font(5), legend = false, markersize=1, alpha = 0.3, smooth = true, ylabel = dcols[row],
                linewidth=3, linecolor=:red))
        elseif row < cols && col > 1
                push!(
                plots,
                scatter(df[:,row], df[:,col], xtickfont = font(5), ytickfont = font(5), legend = false, markersize=1, alpha = 0.3, smooth = true, 
                linewidth=3, linecolor=:red))        
        elseif col == 1 && row == cols
                push!(
                plots,
                scatter(df[:,row], df[:,col], xtickfont = font(5), ytickfont = font(5), legend = false, markersize=1, alpha = 0.3, smooth = true, ylabel = dcols[row], xlabel = dcols[col],
                linewidth=3, linecolor=:red))                   
        elseif row > col && row == cols
                push!(
                plots,
                scatter(df[:,row], df[:,col], xtickfont = font(5), ytickfont = font(5), legend = false, markersize=1, alpha = 0.3, smooth = true, xlabel = dcols[col],
                linewidth=3, linecolor=:red))
        else
                push!(
                plots,    
                histogram(df[:,row],bins=10, xtickfont = font(5), ytickfont = font(5), legend = false, xlabel = dcols[col]) )
        end
    end
    plot(plots..., size=(1200, 1000), layout = (rows, cols))
end

This is the plot (not sure why but the column names disappear pasting the plot here).

Hi there! I’m the author of PairPlots.jl. Thanks for giving it a try.

A couple of things:
First, you don’t need to combine hexbin with scatter to make a scatter plot. The following should work just fine:

using CairoMakie, PairPlots

N = 10000
α = [2randn(N÷2) .+ 6; randn(N÷2)]
β = [3randn(N÷2); 2randn(N÷2)]
γ = randn(N)
δ = β .+ 0.6randn(N)

table =  df = (;α, β, γ, δ)

pairplot(
    table => (
        PairPlots.Scatter(),
        PairPlots.MarginHist(),
    ),
    fullgrid=true
)

Second, I would be happy to add a trend line feature. Should be able to merge it by the end of the day.

Let me know if there’s anything else you might need from the package!

2 Likes

I just added support to PairPlots.jl for plotting a linear trend line:

pairplot(
    table => (
        # choose what kind of series you want in body and along diagonal
        PairPlots.Scatter(),
        PairPlots.MarginHist(),
        # Add trend line
        PairPlots.TrendLine(color=:red),
    ),
    fullgrid=true
)

It just fits a linear trend line for now, but I can imagine extending this in future. We should probably add a package extension to support GLM directly.

You can test it out by running ] add PairPlots#main. I’ll tag a new release the docs are built.

Let me know if there’s anything else I can add.

7 Likes

you can use the ablines! plotting function once you have the parameters

2 Likes

One final question @enzomar: when you say histogram binning is a complex affair, would you mind specifying more what you would like/want?

As a start you can pass bins=42 to change the number of bins along each axis.

For ultimate control, you can pass your own prepare_hist function to a PairPlots.Hist series that does whatever you want. It must just accept a vector of x-values, a vector of y-values, and an nbins argument (which it can ignore), and return a vector of x bin centres, y bin centres, and a matrix of weights.

You can use StatsBase.Histgoram to calculate all of these, or any other package.

Putting these together for your Boston housing data example:

using DataFrames
using MLDatasets: BostonHousing
pairplot(
    dataset.features => (
        PairPlots.Scatter(
            color=:transparent,
            strokecolor=:blue,
            strokewidth=1,
            marker=:diamond,
            markersize=8,
        ),
        PairPlots.MarginHist(color=:darkblue),
        PairPlots.MarginConfidenceLimits(),
        PairPlots.TrendLine(color=:red),
    ),
)

5 Likes

@sefffal Many thanks: fantastic support!!!

I will definitely try this tomorrow (late at night here).

I like the new trendline as well as the type of scatterplot you show here (the HexBin isn’t looking as good for some reason?).

A clarification on the histogram binning. My understanding of the docs is that I can have the type of binning I want, but:

You can optionally pass a function to override how the histogram is calculated. It should have the signature: prepare_hist(xs, ys, nbins) and return a vector of horizontal bin centers, vertical bin centers, and a matrix of weights.

That’s great, but a correlogram (or an extended pairplot if you prefer) is to get a firsrt, global view of the variables in the dataset, getting some ideas on their behaviour, so too many bins or preparing an ad hoc vector is not appropriate.

Ideally I would just like to specify say bins=10 for all variables, like Plots & Plotly allow me to do (and of course if at some point I’ll need to study more in detail some anomaly, the possibility to create more complex custom binnings is great!).

1 Like

Ah okay understood @enzomar. Just pass PairPlots.Hist(bins=10) or PairPlots.MarginHist(bins=10) before PairPlots.Scatter.

And a note to future readers, you can accomplish a similar effect with the smoothed Contour, Contourf, and MarginDensity by adjusting the bandwidth parameter (see the docs for details).

@sefffal it works great!

Just a minor question: how do I change size e.g. in jupyterlab?

I tried

fig = Figure(size=(1200,1000))
pairplot(fig[13,13], dataset.features  => (... ))     # dataset.features has 13 columns
fig

But it didn’t work.

1 Like

Great glad to hear it!
The size argument for Makie does change the resolution but I bet JupyterLab is scaling the resulting image to some preset width.
Maybe you can just save it and open it in another tab?

save("corner.png", fig, px_per_unit=4)

If that’s not convenient, then maybe someone who knows Jupyter better can chime in.

1 Like

@enzomar if this has resolved your issue, I would humbly suggest you mark one of the replies as the “solution” so that future readers can jump to the most relevant reply.
Thanks!

1 Like