How to plot ~10 billion datapoint time series efficiently?

hk69 · May 18, 2022, 1:09am

Hi, I am relatively new to Julia.
I have a physics model that I implemented in Julia. I solve the differential equations to track the orbit of two blackholes over a period of about 5 years with very small time step. My output has ~10 billion elements which I want to plot. What is the best (fastest and memory efficient) way to do this in Julia?

I don’t need to zoom in or interact with my plot. I have tried gr() and inspectdr() backends for Plots.jl and I have also tried CairoMackie backend for Mackie but they just aren’t giving me the performance. pyplot() backend is the fastest that I have tested but that too is useful only up to 1e8 datapoints. I want to go to 1e10 datapoints.

Edit: I can’t use adaptive time-step to reduce the number of datapoints while solving my diff eq because that gives me uneven time steps. I have to take the FFT of my solution which doesn’t work with uneven time steps.

joa-quim · May 18, 2022, 2:25am

It makes no sense to plot that many points. No one can see them. Compute a density distribution and plot that.

JeffreySarnoff · May 18, 2022, 3:10am

would someone show how that might be done?

stillyslalom · May 18, 2022, 5:26am

Putting plotting aside for a moment, several solutions exist for handling uneven timesteps: you could use a nonuniform FFT as provided by FINUFFT.jl or FastTransforms.jl, or use the DiffEq.jl solution interpolation interface to resample onto a uniform grid.

lmiq · May 18, 2022, 7:37am

That would occupy about 80GB of memory, so probably you can’t even load that trajectory in memory.

Can’t you just sample less frequently?

(You don’t need to save every times step of the simulation)

paulmelis · May 18, 2022, 7:43am

Putting aside the technical challenge here, what’s the goal for visualizing your data? Inspect the output for correctness? Get a feel for the patterns in the data? Prepare a publication? As you mention you don’t need interactivity I would assume it is not so much data exploration that you’re interested in, which makes me wonder why you would want to plot all of the data, given its size.

rafael.guerra · May 18, 2022, 9:19am

Please check this GR solution, using shade() to plot a large 1D time series.

Or this other GR example using shadepoints(). By using Float32 , more than 1 billion points can be processed in seconds:

wc4wc4wc4 · May 18, 2022, 10:55am

Can you show a small example of the plot you want to improve?
Do you plot trajectories (i.e. lines) or would a histogram (1D / 2D) make sense instead?

trilobit · May 18, 2022, 12:49pm

Perhaps this helps:

Online algorithm - Wikipedia

scheidan1 · May 18, 2022, 1:49pm

You probably want to downsample your data before plotting.

The thesis “Downsampling Time Series for Visual Representation” by Sveinn Steinarsson investigates different visaually pleasing subsampling strategies (see here).
If I remeber correctly the “Largest Triangle Three Bucket” (LTTB) algorithm is recommended.

~~Unfortunately, I could only find implementations in java and python~~

See here from @stevengj repley below.

stevengj · May 18, 2022, 1:50pm

There are a lot of different algorithms for downsampling huge timeseries datasets for visualization. See this post, for example (including some code).

If the downsampling algorithm is local (as is the case in the linked example above), you can process the dataset in chunks if the whole thing doesn’t fit into memory at once.

joa-quim · May 18, 2022, 1:57pm

Yep, I saw that post when it arrived and is on my todo list to adopt something like that in GMT.jl. But user mentioned orbits which makes think the problem might depend on x,y coordinates. What I had in mind is the use of the GMT module blockmean to compute a grid with some statistics of the orbits into a grid that could later be easily and quick turned into a plot.
I’m not sure right now (would need to check the C code) but I think blockmean does a record-by-record reading so if data is in a disk file, it will take a wile but RAM memory should not be a problem (if it is, the procedure would have to be cut in chunks).

hk69 · May 18, 2022, 3:02pm

I can’t sample less frequently because my orbit is spiraling inwards. It becomes smaller and smaller so if I sample less frequently I get very few data points per orbit which can’t really map the orbit accurately.
I am aware of the memory problem so I have rented a virtual machine with lots of memory to run this code.

stevengj · May 18, 2022, 3:04pm

That’s why downsampling strategies often need to be adaptive, so they sample more frequently when the data is changing more.

hk69 · May 18, 2022, 3:08pm

It is for a publication. I don’t want to plot all of the data, I just can’t find a solution that helps me preserve the signal and create a publication-quality plot. (I also want to overlay other signals on the same plot)

stevengj · May 18, 2022, 3:29pm

For example, here are two recent adaptive downsampling algorithms; see also the references therein:

Daniel et al, “Adaptive resampling for data compression” (2021)
Gil et al “Towards Smart Data Selection From Time Series Using Statistical Methods” (2021).

Not sure what free implementations are out there of this kind of technique, though; I couldn’t find anything even in Python, but maybe I was searching the wrong keywords. (e.g. Pandas.resample uses fixed bin sizes.)

JeffreySarnoff · May 18, 2022, 5:41pm

If it is permissible, post a short exerpt from the data file[s] you want to show along with (if you know, or just guess) any characterizing info (e.g. the largest, smallest and mean values in each data vector). It may help others to give more specific suggestions if you posted even a very coarse plot (one point each million) – or better yet, if there are other plots with a similar look that you could link.

hk69 · May 18, 2022, 8:33pm

This is the plot for a part of the orbit. The full plot will be very similar to it, the yellow curve will just be spanning a bit more of the x-axis

JeffreySarnoff · May 18, 2022, 8:50pm

Is this correct?

And with your data, each of the two “curves” is given as a sequence of (x,y) coordinate pairs. And each “curve” may be determined by, say, 50 billion pairs; and the pairs are stored in some sensible sequence, so the (x,y) pair at index i1 bears some physical contiguity with the predecessor and successor pairs.

hk69 · May 18, 2022, 9:06pm

Yes. The blue curve doesn’t have billions of data points. It’s just the noise curve for comparison. The yellow curve is theoretically calculated with a lot of data points but it is just a simple x,y line plot. The GR.shade example suggested by @rafael.guerra works well for this but it just doesn’t produce a publication quality plot.

Topic		Replies	Views
How to plot very large numbers of points and save the figures with small latencies for raw audio images? Visualization plotting , inspectdr	17	2113	March 21, 2020
Need to plot high resolution time series data General Usage plotting	11	1318	December 24, 2020
GR vs GLVisualize comparison? General Usage	19	3494	July 18, 2017
State of Plotting Packages as of 2018/12/15 New to Julia plotting , packages	39	4705	June 10, 2019
Where is actual development in Plotting? Visualization	162	13341	November 22, 2017

How to plot ~10 billion datapoint time series efficiently?

Related topics