Scatterplot readability

Hey,

I have trouble finding an automatic way to make a scatterplot readable, while the number of points is variable. See the following code :

using Distributions, StatsBase, Plots
data = rand(LogNormal(0,1), (2,1_000_000))
data[2,:] .+= rand(1_000_000) .* data[1,:].^2
# get normalised ranks : 
pseudos(sample) = [ordinalrank(sample[i,:])./(size(sample,2)+1) for i in 1:size(sample,1)]

make_plot(dat,N; kwargs...) = scatter(pseudos(dat[:,1:N])...;kwargs...)


make_plot(data,100) # OK
make_plot(data,1000) # Still OK
make_plot(data,10000) # Unreadable
make_plot(data,10000;markersize=1,markeralpha = 0.5) # Not readable either
make_plot(data,10000;markersize=3,markeralpha = 0.5) # Somehwat OK

make_plot(data,100000) # Lol
make_plot(data,100000;markersize=1,markeralpha = 0.5) # A lot better. 

make_plot(data,1000000;markersize=1,markeralpha = 0.5) # Not OK
make_plot(data,1000000;markersize=1,markeralpha = 0.1) # Beautifull.

Can we make this process automatic, so that whatever N the plot is readable ? I particularly like the last one, but I do need something automatic.

Hey @Irnv

Looking at the code makes me think this may be a bit more of a “problem-by-problem” basis.
In that I mean that when I make plots, to sometimes get a plot to “look good” subjectively speaking, I have to manually tweak it for what is needed at hand (I use Makie.jl and PyPlot.jl).
However, perhaps what you can do is calculate a ratio between your number of points N and your markersize and markeralpha based on the varying nature of N in your make_plot function definition.
Like something like:

alpha = N / 20000
size = N / 20000

I was just using 20000 as an estimate based on when N = 10000 and markeralpha = 0.5.
Perhaps this could be a start for creating an automatic scaling for you.
Just a thought.
Cheers!

~ tcp :deciduous_tree:

1 Like

That is exactly what i’m doing right now, but it’s not really successfull yet. Maybe, as you pointed out, this is too problem-specific to find a good ‘rationale’ that would prevent a crowded plot on one side, and an empty one on the other side

Yea, getting a plot to scale “just right” is really hard I have found.
Frankly, I can automate everything in my plotting workflows for the most part, but the final bit of tweaking to make things legible on the plot is usually a part that I leave for my manual review.
One thing too is if you happen to know what the range of N will always be is to set a range of sorts such that if it is between a certain number, return a set of values like:

args = 10000 < N < 100000 ? (markersize = 1, markeralpha = 0.5) : (N < 10000 ? (markersize = 1, markeralpha = 1) : (markersize = 1, markeralpha = 0.1))

The above is just another thought - rather sloppy but could assist with solving the issue.
Best of luck!

1 Like

There is an additional difficulty: settings that look good on screen will look bad when saved to PNG, and vice versa.

The following functions do a fair job on my system in the range from 10 to 1 million points, with dpi=100 for the screen display and dpi=600 for saving the figure to PNG.

 using Distributions, StatsBase, Plots; gr()

# get normalised ranks : 
pseudos(sample) = [ordinalrank(sample[i,:])./(size(sample,2)+1) for i in 1:size(sample,1)]

make_plot(dat,N; kwargs...) = scatter!(pseudos(dat[:,1:N])...;kwargs...)

data = rand(LogNormal(0,1), (2,1_000_000))
data[2,:] .+= rand(1_000_000) .* data[1,:].^2

dpi = 100  # use dpi=100 for default screen display, and dpi=600 to savefig as PNG
sf = dpi/100
ms(n) = 0.1*sf + 4exp(-sf*5e-5*n)
ma(n) = 0.02*sf + exp(-3e-5*n/sf) 
N = [10^n for n in 1:6]
p = plot(layout=(2,3), size = (1800, 1200), legend=:bottomright, dpi=dpi)
[make_plot(data, N[n], ms=ms(N[n]), ma=ma(N[n]), label=string(N[n]), msw=0, msc=:auto, subplot=n) for n in 1:length(N)]
p
3 Likes

This is very impressive, thanks a lot this solves my issues.

May I ask how you found these ?

Empirically from results in your post and simulating a few additional cases that provided data points for ms(n) and ma(n). From the shapes of those curves, the exponentials seemed sufficient to roughly capture their behavior.

2 Likes

Bravo, really smart.

One last thing, if someone does still care (I do!). This is totally dependent from the final size of the plot, and I have included this parameter in the following modification of your code:

ms(n,size) = (0.1*sf + 4exp(-sf*5e-5*n))*size/600
ma(n,size) = 0.02*sf*sqrt(size/600) + exp(-3e-5*n/sf)
N = [10^n for n in 1:6]
s = 100
p = plot(layout=(2,3), size = (3s,2s), legend=:bottomright, dpi=dpi)
[make_plot(data, N[n], ms=ms(N[n],s), ma=ma(N[n],s), label=string(N[n]), msw=0, msc=:auto, subplot=n) for n in 1:length(N)]
p

This works a little better for small plots (about 200px side), but it is hard to make it work for smaller ones (about 100px size) with many points (the square is just full blue), as well a biger ones (600px is size, albeit 1000px), of course still with the two potential dpi.

1 Like

For scatter plots with many points, you should really use something like https://datashader.org/. It’s essentially a heatmap but where each pixel is a bin.

I made a basic Julia version of this: GitHub - cjdoris/ShadeYourData.jl: Interactive plotting of millions of data points. It’s a few years old and written for an old version of Makie so won’t work now but could be resurrected.

1 Like

This is beautifull. However that seems a little too much work for me right now :slight_smile:

You might just use heatmap to begin with, if you have a lot of points > 1e6, a histogramlooks reasonably well (just disable the colorbar if you don’t like it)

image

without the bar:
image

Run

points = pseudos(data[:,1:1000000])
histogram2d(points[1], points[2], cbar=false)

I think the histogram tells a much better story that a huge part of events is at the right top corner (3.5e4 events), whereas scatterplot cannot reflect this anomalous density

Thanks for your input, the heat map is indeed a neat way to show it.

But we actually need the level of details the scatter plots reflects. As you may have noted, this is pseudo-data, i.e. normalized ranks or copula sample. Hence both marginal distributions are uniform on [0,1]. Knowing this usually facilitates the reading of such graphs. It also allows to predict the bright color of the last square of the heat map easily from the pseudo-data : the tails are clearly dependant.