How to plot ~10 billion datapoint time series efficiently?

Publication quality is typically (at least) 300 dpi. Say your graph is 6 inches wide – then you have 1800 possible X values to plot.

Couldn’t you just bin (or “bucket”) the data by x-value, then compute the mean(y) for each x-value?

That is, for each x value, transform it by

xbin = floor(x;digits=-(trunc(Int,log10(x))-1))

Examples:

julia> x=8.91826e-6
8.91826e-6

julia> xbin=floor(x;digits=-(trunc(Int,log10(x))-1))
8.0e-6

julia> x=0.024963571
0.024963571

julia> xbin=floor(x;digits=-(trunc(Int,log10(x))-1))
0.02

Then, for each distinct x-value 1 to .0001, compute the mean(y) for each “bin”.

Then plot (xbin,ybin)

Note that for each xbin, you could also compute the standard deviation of y, the min of y, and the max of y, if you’d like to plot confidence intervals or ranges as well.

The above is “equal width binning”.

For “equal frequency binning,” you would divide you number of data points by number of bins – so 10 billion / 1800 = 5,555,555.

Sort you data by x – or in your case, it probably is gathered in sorted order already.

Take the first 5,555,555 points. Compute the mean of x and the mean of y for those data points. Store it (xbin,ybin).

Repeat for every batch of 5,555,555 points.

At the end, you will have 1800 values of (xbin,ybin) to plot. The xbin-values won’t be equally spaced like they are in the “equal width binning” method.

It also perhaps let’s you reduce memory usage – as you compute each batch of 5,555,555 points, compute (xbin,ybin), then throw out those points.

As mentioned above this “naive” binning approach can misrepresent a line plot if the line plot is not “smooth” compared to the binwidth. Sharp dips/peaks get erased by such a method. These sorts of sharp peaks can be very important to the proper interpretation of the chart.

Whether this particular use case is smooth enough for this to work is another question, the sample above appears like it may in fact be smooth enough, but nevertheless some of the above conversation has been about various algorithms that outperform this technique over a wider range of lineplots that may not be adequately “smooth”.

1 Like

Yet if the maximum number of x-values is 1800, and you compute 1800 bins, haven’t you achieved the maximum resolution?


This image is taken from the above Masters thesis. The right plot has been downsampled as you describe. Notice how the sharp peaks have been erased.

The question boils down to for a given single pixel column which y-pixel provides the information that best describes the data in that x-bin. If narrow sharp peaks are important then taking the average of a bin is inadequate as those sharp peaks are not preserved. More advanced techniques user various heuristics to determine which points within a bin are the “most important” to determine which y-pixel within an x-bin to plot.

2 Likes

That particular plot uses a nth data point algorithm, which is not binning.

However, I take your point. The each x bin value has a distribution of y values. The question is what point from that y distribution to plot for each x bin. It comes down to “what you want” and “what’s important to you.”

Which is ridiculous in a sense - trying to represent a distribution with a single point. Hence one could compute additional statistics – standard deviation, max, min, various quantiles, etc. – and use those to plot error bars, or confidence shading around the line.

Say instead of plotting just the mean y for each xbin, he also plotted 2 more lines – one for the max y, the other for the min y. Now you have an outline of the data and it’s central tendency.

2 Likes

You’d need to convert the plot to pixel graphics either way, since no document tooling and no journal in the world could process or accept vector plots with 10 billion data points.
Converting to e.g. 1000x1000px gives you ~1000 points.

What you really want is not a library that will generate 2D line plots with 10 billion points efficiently (there is none, since this is a useless task), but a lib that reduces your data to an extent, that you can efficiently plot it. That way you can have both: use standard tooling and have fancy vector plots for a journal.

@joa-quim already suggested the right thing, there’s plenty of well researched point reduction algorithms to reduce your data according to some quality criteria. Two of the most commonly used algos are Douglas-Peucker and Visvalingam-Whyatt (I prefer the later, as it’s dead simple).
Some of them are presented here, simply pick the one that suits you best:

1 Like

And I also found this site that has a py script for the Douglas-Peucker in spherical coordinates, which is something we are needing for quite some time in GMT

Notice that these Douglas-Peucker reduction algorithm is implemented in Meshes.jl:

https://juliageometry.github.io/Meshes.jl/stable/algorithms/simplification.html#Douglas-Peucker

3 Likes