Failing to plot correlograms in Julia: Makie vs. AoG vs StatsPlots

@rafael.guerra Thanks for stimulating me to use Plotly!

GOOD: It worked fine and it is very fast. Best result so far.

NOT so GOOD: this is a general but minor problem I find with Plotly: it seems the relative labextension is out of sync with jupyterlab, which at times gives issues. I used vscode for this plot.

This is my plotting function (it needs a bit of refactoring as there are too many duplications in the parameter settings):

function correlogram(df)
    rows = cols = size(df,2)
    dcols = names(df)
    plots = []
    for row = 1:rows, col = 1:cols
        if col == 1 && row == 1
                push!(
                plots,
                histogram(df[:,row],bins=10, xtickfont = font(5), ytickfont = font(5), legend = false,  ylabel = dcols[row]))
        elseif  row == col && row < cols && row >1
                push!(
                plots,    
                histogram(df[:,row],bins=10, xtickfont = font(5), ytickfont = font(5), legend = false))
        elseif row < cols && col == 1
                push!(
                plots,
                scatter(df[:,row], df[:,col], xtickfont = font(5), ytickfont = font(5), legend = false, markersize=1, alpha = 0.3, smooth = true, ylabel = dcols[row],
                linewidth=3, linecolor=:red))
        elseif row < cols && col > 1
                push!(
                plots,
                scatter(df[:,row], df[:,col], xtickfont = font(5), ytickfont = font(5), legend = false, markersize=1, alpha = 0.3, smooth = true, 
                linewidth=3, linecolor=:red))        
        elseif col == 1 && row == cols
                push!(
                plots,
                scatter(df[:,row], df[:,col], xtickfont = font(5), ytickfont = font(5), legend = false, markersize=1, alpha = 0.3, smooth = true, ylabel = dcols[row], xlabel = dcols[col],
                linewidth=3, linecolor=:red))                   
        elseif row > col && row == cols
                push!(
                plots,
                scatter(df[:,row], df[:,col], xtickfont = font(5), ytickfont = font(5), legend = false, markersize=1, alpha = 0.3, smooth = true, xlabel = dcols[col],
                linewidth=3, linecolor=:red))
        else
                push!(
                plots,    
                histogram(df[:,row],bins=10, xtickfont = font(5), ytickfont = font(5), legend = false, xlabel = dcols[col]) )
        end
    end
    plot(plots..., size=(1200, 1000), layout = (rows, cols))
end

This is the plot (not sure why but the column names disappear pasting the plot here).

Hi there! Iā€™m the author of PairPlots.jl. Thanks for giving it a try.

A couple of things:
First, you donā€™t need to combine hexbin with scatter to make a scatter plot. The following should work just fine:

using CairoMakie, PairPlots

N = 10000
Ī± = [2randn(NĆ·2) .+ 6; randn(NĆ·2)]
Ī² = [3randn(NĆ·2); 2randn(NĆ·2)]
Ī³ = randn(N)
Ī“ = Ī² .+ 0.6randn(N)

table =  df = (;Ī±, Ī², Ī³, Ī“)

pairplot(
    table => (
        PairPlots.Scatter(),
        PairPlots.MarginHist(),
    ),
    fullgrid=true
)

Second, I would be happy to add a trend line feature. Should be able to merge it by the end of the day.

Let me know if thereā€™s anything else you might need from the package!

2 Likes

I just added support to PairPlots.jl for plotting a linear trend line:

pairplot(
    table => (
        # choose what kind of series you want in body and along diagonal
        PairPlots.Scatter(),
        PairPlots.MarginHist(),
        # Add trend line
        PairPlots.TrendLine(color=:red),
    ),
    fullgrid=true
)

It just fits a linear trend line for now, but I can imagine extending this in future. We should probably add a package extension to support GLM directly.

You can test it out by running ] add PairPlots#main. Iā€™ll tag a new release the docs are built.

Let me know if thereā€™s anything else I can add.

7 Likes

you can use the ablines! plotting function once you have the parameters

2 Likes

One final question @enzomar: when you say histogram binning is a complex affair, would you mind specifying more what you would like/want?

As a start you can pass bins=42 to change the number of bins along each axis.

For ultimate control, you can pass your own prepare_hist function to a PairPlots.Hist series that does whatever you want. It must just accept a vector of x-values, a vector of y-values, and an nbins argument (which it can ignore), and return a vector of x bin centres, y bin centres, and a matrix of weights.

You can use StatsBase.Histgoram to calculate all of these, or any other package.

Putting these together for your Boston housing data example:

using DataFrames
using MLDatasets: BostonHousing
pairplot(
    dataset.features => (
        PairPlots.Scatter(
            color=:transparent,
            strokecolor=:blue,
            strokewidth=1,
            marker=:diamond,
            markersize=8,
        ),
        PairPlots.MarginHist(color=:darkblue),
        PairPlots.MarginConfidenceLimits(),
        PairPlots.TrendLine(color=:red),
    ),
)

5 Likes

@sefffal Many thanks: fantastic support!!!

I will definitely try this tomorrow (late at night here).

I like the new trendline as well as the type of scatterplot you show here (the HexBin isnā€™t looking as good for some reason?).

A clarification on the histogram binning. My understanding of the docs is that I can have the type of binning I want, but:

You can optionally pass a function to override how the histogram is calculated. It should have the signature: prepare_hist(xs, ys, nbins) and return a vector of horizontal bin centers, vertical bin centers, and a matrix of weights.

Thatā€™s great, but a correlogram (or an extended pairplot if you prefer) is to get a firsrt, global view of the variables in the dataset, getting some ideas on their behaviour, so too many bins or preparing an ad hoc vector is not appropriate.

Ideally I would just like to specify say bins=10 for all variables, like Plots & Plotly allow me to do (and of course if at some point Iā€™ll need to study more in detail some anomaly, the possibility to create more complex custom binnings is great!).

1 Like

Ah okay understood @enzomar. Just pass PairPlots.Hist(bins=10) or PairPlots.MarginHist(bins=10) before PairPlots.Scatter.

And a note to future readers, you can accomplish a similar effect with the smoothed Contour, Contourf, and MarginDensity by adjusting the bandwidth parameter (see the docs for details).

@sefffal it works great!

Just a minor question: how do I change size e.g. in jupyterlab?

I tried

fig = Figure(size=(1200,1000))
pairplot(fig[13,13], dataset.features  => (... ))     # dataset.features has 13 columns
fig

But it didnā€™t work.

1 Like

Great glad to hear it!
The size argument for Makie does change the resolution but I bet JupyterLab is scaling the resulting image to some preset width.
Maybe you can just save it and open it in another tab?

save("corner.png", fig, px_per_unit=4)

If thatā€™s not convenient, then maybe someone who knows Jupyter better can chime in.

1 Like

@enzomar if this has resolved your issue, I would humbly suggest you mark one of the replies as the ā€œsolutionā€ so that future readers can jump to the most relevant reply.
Thanks!

1 Like