Failing to plot correlograms in Julia: Makie vs. AoG vs StatsPlots

I am trying with any possible package to plot a classic correlation plot (scatterplots for each variable vs each other and density histogram on the diagonal in the middle). Sometime they are cold correlogram or pairplots. I come from R and Python and it is dead easy with each language to get a plot in no time.

These have been my attempts so far:

Statsplots corrplot @df df corrplot(cols(1:11), grid = false, fc=:thermal)
I was happy to see a macro for this. Works OK but makes julia crash and hang unresponsively with > 11 columns (issue reported on Statsplots with reproducible example: I use the Boston housing dataset).
I tried converting the df to a Matrix but the plot still crashes (at lower dimensional values the results appear slightly different that the df version. Boh?).
Also, not sure what are the heatmaps on the upper part of the grid, but once colored properly they look nice.

AlgebraOfGraphics
So far this is the farthest I have got to. Maybe my experience with ggplot helps. I got my plot with all of the 13th columns with no crash. Pumas documentation on AoG was a great help as well (should be publicised more!).
Only problem I was not able to find how to get a density histogram in the middle diagonal.
HELP PLEASE!!!
This is the crucial part of the plot:

plt = data(df) * (visual(Scatter; color=:dodgerblue) + linear() * visual(; color=:red)) * mapping(cols, permutedims(cols), col=dims(1), row=dims(2))
fg = draw(
    plt,
    figure=(figure_padding=0, size=(1500, 1500))
)

Makie
I found an old version of code here in Julia Discourse. Sadly Makie has changes a lot its syntax since (probably more than once) and I couldn’t find any good documentation on how to do these type of plots nor any docs on migrating from older versions (what the heck is LAxis?!? is it dead or still used??).
I reproduce the plot code here if somebody can help in bringing it up to current.
HELP PLEASE!!!

function  pairplot(df)
    dim = size(df, 2)-1

    scene, layout = layoutscene(30, resolution = (900, 900))
    axs = layout[1:dim, 1:dim] = [LAxis(scene) for i in 1:dim^2]
    x = 0
    for i in 1:dim, j in 1:dim  
	      
	if i == j
	    x+=1
	    plt = plot!(axs[x],Position.stack, histogram, Data(df), Group(:class), df[:, i])
	else
	    x+=1
	    plt = Makie.scatter!(axs[x], Data(df), Group(:class), df[:,j], df[:,i])    
	end
    end
    scene
end
(from https://discourse.julialang.org/t/makie-pairplot/39298/3)

LAxis comes from MakieLayout.jl. Here are the docs: LAxis · MakieLayout.jl

Hope this helps, I am no Makie.jl expert, I only use (GL)Makie for interactive 3D stuff :wink:

LAxis was the old name for Axis before MakieLayout got fully integrated.

Check out GitHub - sefffal/PairPlots.jl: Beautiful and flexible vizualizations of high dimensional data which is based on Makie

2 Likes

Try with Gnuplot.jl:

using RDatasets, Gnuplot
df = dataset("datasets", "iris")[:, 1:4]

@gp "set multiplot layout $(ncol(df)), $(ncol(df)) margins 0.1, 0.9, 0.15, 0.9 spacing 0.01 columnsfirst upward" :-
id = 1
for ix in 1:ncol(df)
    for iy in 1:ncol(df)
        (iy == 1)  &&  (@gp :- id xlab=names(df)[ix] "set xtics format '% h'" :-)
        (ix == 1)  &&  (@gp :- id ylab=names(df)[iy] "set ytics format '% h'" :-)

        xr = [extrema(df[:, ix])...]
        yr = [extrema(df[:, iy])...]
        if ix == iy
            @gp :- id xr=xr yr=[NaN,NaN] hist(df[:, ix], range=xr, nbins=5) :-
        else
            @gp :- id xr=xr yr=yr df[:, ix] df[:, iy] "w p notit" :-
        end
        id += 1
        @gp :- id xlab="" ylab="" "set xtics format ''" "set ytics format ''" :-
    end
end
@gp

1 Like

Another option
https://www.generic-mapping-tools.org/GMTjl_doc/examples/plotting_functions/05_1_stats/#example_6265284708922005600

@joa-quim thanks for the suggestion. Sadly GMT crashes with the Boston Housing dataset (but works with IRIS).

Issue reported here: cornerplot fails with Boston Housing dataset · Issue #1333 · GenericMappingTools/GMT.jl · GitHub

@gcalderone thanks for the suggestion!.

Good news: it works with the Boston Housing dataset (i.e, does not crash!). IT is also very fast.

The graphics are a bit primitive, but I’ll try to clean it a bit.

It should ring a wake-up call bell of sort for the community: so far is the only plot package that works fully (i.e. including density plots on longest diagonal), but is a package dating back to 1986, according to wikipedia!

Nice and rich visualizations would certainly help to attract R and Python users.

1 Like

I added a new example in the documentation.

I typically use the following global settings to obtain a reasonably nice output:

Gnuplot.options.term = "wxt size 700,400";
push!(Gnuplot.options.init, Gnuplot.linetypes(:Set1_5, lw=1.5, ps=1.5));

1986 yes, but as you can see the latest version has been released two months ago, and they are working on the new version 6.1.

I agree it looks a bit old-fashioned, but it also is extremely powerful and easy to use.
IMHO it is not easy to catch up 40 years of development… :wink:

1 Like

Yeah I think they might have overlooked my link :slight_smile:

Not a crash but instead a corner case in the parser to the GMT syntax that lead to a wrong decision of how to interpret the generated command. That’s what this message mean

]: Cannot tell if -T1 -W0.1 is new or deprecated syntax; selected deprecated.
histogram [ERROR]: Unrecognized option -T

Fixed in master and now it produces this. Still not perfect due to the number of subplots. I wonder what best solution is in these type of cases.

Note, the Z’s result from the fact that M = Matrix(df[:, 1:end-1]) lost the column names in original data.

1 Like

Thanks @jules for the link. I’ll try it and report.

@joa-quim I got the same result, which is good. Oddly I got in on a saved image, while the same code for IRIS was reproducing inline in my jupyterlab notebook.

Is there a reason why the same code renders in a saved image in one case and inline in the other?
Moreover, I use a Matrix as this is what the GMT function expect. hence no real variable name but only dummy ones. Do you have any suggestion on how to get the actual dataframe column names of the variables outside of the plot?

A quick before Christmas dinner answer. Need to commit further changes to deal with names easier.
Try
GMT.df2ds(df) instead of Matrix

An update:

GOOD: I built a good correlogram with Plots and I’m quite satisfied with it: it is very close to optimal.

BAD: Again (and perhaps not surprisingly as it was happening with StatsPlots) crashes with more than 11 variables / columns.

I just reported it on Github here: [BUG] Plots crashes and julia hangs plotting correlogram with >12 columns on Boston housing dataset · Issue #4856 · JuliaPlots/Plots.jl · GitHub

Next I will test PairPlots (as it is built on Makie I’m moderately optimist).

If there’s any good Makie chap around I would still like to test this on native Makie but I need HELP to migrate the code I posted above.

Many thanks in advance.

In case it helps, for a large number of subplots, the plotly() backend is best. See example here.

1 Like

The thread you got your non-working Makie version from had other, newer examples (plus the PairPlots.jl link that I posted here as well) but anyway, here’s the example from above modified a bit for current Makie. This doesn’t do any grouping, though, which is why something like PairPlots saves you work.

using CairoMakie

function pairplot(df)
    dim = size(df, 2)-1

    f = Figure(size = (900, 900), fontsize = 10)

    # don't update layout with every new axis, that's slow
    with_updates_suspended(f.layout) do
        axs = [Axis(f[i, j]) for i in 1:dim, j in 1:dim]
        
        x = 0
        for i in 1:dim, j in 1:dim  
            x+=1
            if i == j
                hist!(axs[x], df[:, i])
            else
                scatter!(axs[x], df[:,j], df[:,i])    
            end
        end
    end
    
    f
end


df = randn(100, 10)

pairplot(df)

2 Likes

PairPlot.jl cited several times above seems very capable indeed. I’d recommend to look at its announcement thread to get a better idea of its potential

1 Like

PairPlots!

I eventually got around to text PairPlots.

GOOD:

  • It doesn’t crash with the full 14 columns of the Boston housing dataset!
  • Prints the column names on the edges: nice!
  • Plots Truth lines out-of-the-box

NOT so GOOD:

  • I couldn’t find a way to get out-of-the-box a trendline (a classing LR line like AoG or Plots).
  • It seems that histogram binning is a complex affair, possible but not out-of-the-box
  • Scatter plot: requires HexBin + Scatter (not entirely sure why not just one func?)
pairplot(
    df => (
        PairPlots.HexBin(colormap=Makie.cgrad([:transparent, :green])),
        PairPlots.Scatter(), 
        PairPlots.MarginHist(color=:blue),
        PairPlots.MarginConfidenceLimits(),
    )
)

Thanks @jules this looks great!

I think I should be able to add column names programmatically.

The only issue I can see is that I couldn’t find in Makie a way to add a trend line out-of-the-box (perhaps not surprisingly the same occurs with PairPlots).