Scatterplot readability

lrnv · December 30, 2021, 11:18am

Hey,

I have trouble finding an automatic way to make a scatterplot readable, while the number of points is variable. See the following code :

using Distributions, StatsBase, Plots
data = rand(LogNormal(0,1), (2,1_000_000))
data[2,:] .+= rand(1_000_000) .* data[1,:].^2
# get normalised ranks : 
pseudos(sample) = [ordinalrank(sample[i,:])./(size(sample,2)+1) for i in 1:size(sample,1)]

make_plot(dat,N; kwargs...) = scatter(pseudos(dat[:,1:N])...;kwargs...)


make_plot(data,100) # OK
make_plot(data,1000) # Still OK
make_plot(data,10000) # Unreadable
make_plot(data,10000;markersize=1,markeralpha = 0.5) # Not readable either
make_plot(data,10000;markersize=3,markeralpha = 0.5) # Somehwat OK

make_plot(data,100000) # Lol
make_plot(data,100000;markersize=1,markeralpha = 0.5) # A lot better. 

make_plot(data,1000000;markersize=1,markeralpha = 0.5) # Not OK
make_plot(data,1000000;markersize=1,markeralpha = 0.1) # Beautifull.

Can we make this process automatic, so that whatever N the plot is readable ? I particularly like the last one, but I do need something automatic.

TheCedarPrince · December 30, 2021, 3:01pm

Hey @Irnv

Looking at the code makes me think this may be a bit more of a “problem-by-problem” basis.
In that I mean that when I make plots, to sometimes get a plot to “look good” subjectively speaking, I have to manually tweak it for what is needed at hand (I use Makie.jl and PyPlot.jl).
However, perhaps what you can do is calculate a ratio between your number of points N and your markersize and markeralpha based on the varying nature of N in your make_plot function definition.
Like something like:

alpha = N / 20000
size = N / 20000

I was just using 20000 as an estimate based on when N = 10000 and markeralpha = 0.5.
Perhaps this could be a start for creating an automatic scaling for you.
Just a thought.
Cheers!

~ tcp

lrnv · December 30, 2021, 3:04pm

That is exactly what i’m doing right now, but it’s not really successfull yet. Maybe, as you pointed out, this is too problem-specific to find a good ‘rationale’ that would prevent a crowded plot on one side, and an empty one on the other side

TheCedarPrince · December 30, 2021, 3:10pm

Yea, getting a plot to scale “just right” is really hard I have found.
Frankly, I can automate everything in my plotting workflows for the most part, but the final bit of tweaking to make things legible on the plot is usually a part that I leave for my manual review.
One thing too is if you happen to know what the range of N will always be is to set a range of sorts such that if it is between a certain number, return a set of values like:

args = 10000 < N < 100000 ? (markersize = 1, markeralpha = 0.5) : (N < 10000 ? (markersize = 1, markeralpha = 1) : (markersize = 1, markeralpha = 0.1))

The above is just another thought - rather sloppy but could assist with solving the issue.
Best of luck!

rafael.guerra · December 30, 2021, 9:53pm

There is an additional difficulty: settings that look good on screen will look bad when saved to PNG, and vice versa.

The following functions do a fair job on my system in the range from 10 to 1 million points, with dpi=100 for the screen display and dpi=600 for saving the figure to PNG.

 using Distributions, StatsBase, Plots; gr()

# get normalised ranks : 
pseudos(sample) = [ordinalrank(sample[i,:])./(size(sample,2)+1) for i in 1:size(sample,1)]

make_plot(dat,N; kwargs...) = scatter!(pseudos(dat[:,1:N])...;kwargs...)

data = rand(LogNormal(0,1), (2,1_000_000))
data[2,:] .+= rand(1_000_000) .* data[1,:].^2

dpi = 100  # use dpi=100 for default screen display, and dpi=600 to savefig as PNG
sf = dpi/100
ms(n) = 0.1*sf + 4exp(-sf*5e-5*n)
ma(n) = 0.02*sf + exp(-3e-5*n/sf) 
N = [10^n for n in 1:6]
p = plot(layout=(2,3), size = (1800, 1200), legend=:bottomright, dpi=dpi)
[make_plot(data, N[n], ms=ms(N[n]), ma=ma(N[n]), label=string(N[n]), msw=0, msc=:auto, subplot=n) for n in 1:length(N)]
p

Screen display (dpi=100):

Plots_scatter_from_10_to_1M_dots_dpi1001794×1196 251 KB
Savefig as PNG (dpi=600):

Plots_scatter_from_10_to_1M_dots_dpi6001920×1274 302 KB

lrnv · December 30, 2021, 10:00pm

This is very impressive, thanks a lot this solves my issues.

May I ask how you found these ?

rafael.guerra · December 30, 2021, 10:15pm

Empirically from results in your post and simulating a few additional cases that provided data points for ms(n) and ma(n). From the shapes of those curves, the exponentials seemed sufficient to roughly capture their behavior.

lrnv · December 30, 2021, 10:20pm

Bravo, really smart.

lrnv · January 5, 2022, 9:07am

One last thing, if someone does still care (I do!). This is totally dependent from the final size of the plot, and I have included this parameter in the following modification of your code:

ms(n,size) = (0.1*sf + 4exp(-sf*5e-5*n))*size/600
ma(n,size) = 0.02*sf*sqrt(size/600) + exp(-3e-5*n/sf)
N = [10^n for n in 1:6]
s = 100
p = plot(layout=(2,3), size = (3s,2s), legend=:bottomright, dpi=dpi)
[make_plot(data, N[n], ms=ms(N[n],s), ma=ma(N[n],s), label=string(N[n]), msw=0, msc=:auto, subplot=n) for n in 1:length(N)]
p

This works a little better for small plots (about 200px side), but it is hard to make it work for smaller ones (about 100px size) with many points (the square is just full blue), as well a biger ones (600px is size, albeit 1000px), of course still with the two potential dpi.

cjdoris · January 5, 2022, 12:45pm

For scatter plots with many points, you should really use something like https://datashader.org/. It’s essentially a heatmap but where each pixel is a bin.

I made a basic Julia version of this: https://github.com/cjdoris/ShadeYourData.jl. It’s a few years old and written for an old version of Makie so won’t work now but could be resurrected.

lrnv · January 5, 2022, 12:48pm

This is beautifull. However that seems a little too much work for me right now

isentropic · January 12, 2022, 6:12am

You might just use heatmap to begin with, if you have a lot of points > 1e6, a histogramlooks reasonably well (just disable the colorbar if you don’t like it)

without the bar:

Run

points = pseudos(data[:,1:1000000])
histogram2d(points[1], points[2], cbar=false)

I think the histogram tells a much better story that a huge part of events is at the right top corner (3.5e4 events), whereas scatterplot cannot reflect this anomalous density

lrnv · January 12, 2022, 8:06am

Thanks for your input, the heat map is indeed a neat way to show it.

But we actually need the level of details the scatter plots reflects. As you may have noted, this is pseudo-data, i.e. normalized ranks or copula sample. Hence both marginal distributions are uniform on [0,1]. Knowing this usually facilitates the reading of such graphs. It also allows to predict the bright color of the last square of the heat map easily from the pseudo-data : the tails are clearly dependant.

Topic		Replies	Views
Plots: reduce size of scatter plots, etc.? General Usage	9	3915	August 7, 2020
PyPlot versus GR scatter plot appearance Visualization plotting	19	4546	December 22, 2016
Strange things with plots, InspectDR General Usage question , plots , inspectdr	4	1118	July 21, 2017
How to plot circles with given radii using Plots General Usage plots	12	6329	October 7, 2020
Can pyplot scatterplots take values smaller than 1e-16? Visualization	5	543	August 26, 2020

Scatterplot readability

Related topics