Plots: reduce size of scatter plots, etc.?

I’m trying to visualize parameter uncertainty in results from the Turing package. For each parameter, there is a vector of realizations of the parameter. If I use the StatsPlots package, function density allows me to plot the probability density function computed from each vector. However, I’d like to display how pairs of parameters co-vary. Three possible ways to do this for parameter realizations pr1 and pr2:

  • scatter(pr1,pr2) gives a scatter plot. Darker areas in the plot correspond to higher probability of parameter pairs.
  • histogram2(pr1,pr2,bins=N) gives a 2D histogram, where color-coding determines high probability.
  • it should also be possible to fit a function f: \mathbb{R}\times \mathbb{R} \rightarrow \mathbb{R}_0^+ to a histogram, and do a contour plot of f to illustrate how the parameters co-vary.

Two questions…

  1. In my case, each parameter vector of realizations contain in the order of 10^4 elements. When I save a scatter plot as an SVG file, the file becomes very large, and when I insert this in a LaTeX document, the printing of the resulting pdf document crashes because of the file size.
  • Is there a way to reduce the file size? (Save the scatter plot as pdf, png, etc.??). What is the recommendation?
  1. If I produce a histogram, the resulting SVG file is also quite large.
  • Same question as for scatterplot.
  1. I can always create a 2d histogram and use, say, Flux, or some other tool to fit a surface to the histogram – or some other kind of curve fitting or interpolation (e.g., second order spline interpolation, or something). I assume that I can then produce a contour plot using Plots.
  • what is the simplest/best way to produce such a surface function?
  • do you think this will produce a smaller plot file?

I’m grateful for any recommendation as to best practice.

Thanks for the hints. It’s always good to have MWEs if things crash or slow down. Could you provide some examples or example data for your questions?

1 Like

Here is a simple, constructed example:

# Packages
using Plots
using Distributions
#
# Distribution
μ = [1,2]
Σ = [3 1;1 1.5]
#
d = MvNormal(μ,Σ)
#
# Generating data + plotting
N = 10_000
data = rand(d,N)
scatter(data[1,:],data[2,:],ma=0.05,label="")

The resulting scatter plot is:

The two rows of matrix data are akin to two vectors of parameter realization from Turing. In this plot, the darker the color (due to many overlapping data pairs), the higher the probability of pairs in the given location.

Alternatively, I could do a 2D histogram plot:

histogram2d(data[1,:],data[2,:],bins=Int(sqrt(N)),label="")

leading to

The point in my question(s) is that when I store a scatter plot with 10^4 points, the resulting SVG file is enormous, leading to problems with printing the resulting PDF file from the LaTeX document.

The histogram plot is possibly smaller, but still large.

So: questions…

  • can I produce good quality plots with other file formats than SVG such that it reduces the plot figure size?
  • would it be better to fit a function to the histogram or do interpolation of the histogram, and plot a contour plot?
  • [Note: I could fit a Distributions.jl distribution to the data set, but that would severely limit the shape of the contours.]

Since you ask, I should add: would the GR backend produce a smaller scatter plot file than the PyPlot backend? There are lots of things to like about the GR backend (cleaner plots, the forthcoming improved support for LaTeX) – I still normally use PyPlot because of somewhat better support for LaTeX, etc., but have had problems with PyPlot lately.

What about using GR’s data shader, e.g.

using Distributions
#
# Distribution
μ = [1,2]
Σ = [3 1;1 1.5]
#
d = MvNormal(μ,Σ)
#
# Generating data + plotting
N = 1_000_000
data = rand(d,N)

using GR
shade(data[1,:],data[2,:],colormap=-8)

3 Likes

The clarity of the GR plot is amazing. I use Plots with the GR background and almost never any other background. Does GR by itself provide most of the same capabilites as GR under Plots? Is it fairly fast?

1 Like

I’ve been saving these kinds of plots as PNG. You won’t be able to avoid a large filesize with vector formats for this many points.

You can save only the plot without axes, and then use pgfplots with \addplot graphics[xmin={...},xmax={...},ymin={...},ymax={...}] {file.png}

2 Likes

Thanks for all input. I’ve tested some options. First of all, I also tested fitting a surface to the data:

pdf_fit = (x,y) -> pdf(fit(MvNormal,data),[x,y])
#
x = range(-5,5,length=50)
y = range(-2,6,length=50)
plot(x,y,pdf_fit,st=:contour,fill=:true,c=:YlOrBr_9)

[In general, I’d not use the Normal distribution because I don’t want to assume that…] The result:
image

So – I saved each plot type (scatter, histogram2d, shade, contour) as svg file, pdf file, and png file.

Here are the results:
scatter: 1.8 MB (svg), 2.3 MB (pdf), 140 kB (png)
histogram2d: 55 kB (svg), 40 kB (pdf), 17 kB (png)
shade: 85 kB (svg), 74 kB (pdf), 83 kB (png)
contour: same as shade

So – either png file and scatter, or any of the other plots.

Note that these plots can be very misleading because of overplotting. It is hard to have a general solution, but in most cases I use a 2D KDE plot with HPD contour lines, see

1 Like

@Tamas_Papp: I tested your code. I tried to wrap it into a quick-and-dirty function:

# Pkg.add(PackageSpec(url="https://github.com/tpapp/HighestDensityRegions.jl"))
# Pkg.add("KernelDensity")
using KernelDensity, HighestDensityRegions
#
function pdf_contour(data;gridlength=100, quantiles=0.05:0.1:0.95)
    k = kde((data[1,:],data[2,:]))
    ik = InterpKDE(k);
    probs = pdf.(Ref(ik), data[1,:], data[2,:])
    thresholds = hdr_thresholds(quantiles, probs)
    grid1 = range(minimum(data[1,:]),maximum(data[1,:]),length=gridlength)
    grid2 = range(minimum(data[2,:]),maximum(data[2,:]),length=gridlength)
    return grid1,grid2,pdf.(Ref(ik),grid1,grid2')
end
#
plot(pdf_contour(data;gridlength=100,quantiles=0.05:0.3:0.95)...,fill=:true,c=:YlOrBr_9)
plot!(xlim=(-5,5),ylim=(-1,6))

Result – beautiful:

File sizes (new result: HDR - HighestDensityRegions):
scatter: 1.8 MB (svg), 2.3 MB (pdf), 140 kB (png)
histogram2d: 55 kB (svg), 40 kB (pdf), 17 kB (png)
shade: 85 kB (svg), 74 kB (pdf), 83 kB (png)
contour: same as shade
HDR: 123 kB (svg), 102 kB (pdf), 83 kB (png)

I’m just a “hacker” when it comes to Julia programming – it would be fantastic if something like my function pdf_contour (with a better name, and generalized to 1D and 2D cases) were available in your package :smiley:

1 Like

Please open an issue so that it is not forgotten.