How to plot ~10 billion datapoint time series efficiently?

Hi, I am relatively new to Julia.
I have a physics model that I implemented in Julia. I solve the differential equations to track the orbit of two blackholes over a period of about 5 years with very small time step. My output has ~10 billion elements which I want to plot. What is the best (fastest and memory efficient) way to do this in Julia?

I don’t need to zoom in or interact with my plot. I have tried gr() and inspectdr() backends for Plots.jl and I have also tried CairoMackie backend for Mackie but they just aren’t giving me the performance. pyplot() backend is the fastest that I have tested but that too is useful only up to 1e8 datapoints. I want to go to 1e10 datapoints.

Edit: I can’t use adaptive time-step to reduce the number of datapoints while solving my diff eq because that gives me uneven time steps. I have to take the FFT of my solution which doesn’t work with uneven time steps.

1 Like

It makes no sense to plot that many points. No one can see them. Compute a density distribution and plot that.

12 Likes

would someone show how that might be done?

1 Like

Putting plotting aside for a moment, several solutions exist for handling uneven timesteps: you could use a nonuniform FFT as provided by FINUFFT.jl or FastTransforms.jl, or use the DiffEq.jl solution interpolation interface to resample onto a uniform grid.

5 Likes

That would occupy about 80GB of memory, so probably you can’t even load that trajectory in memory.

Can’t you just sample less frequently?

(You don’t need to save every times step of the simulation)

1 Like

Putting aside the technical challenge here, what’s the goal for visualizing your data? Inspect the output for correctness? Get a feel for the patterns in the data? Prepare a publication? As you mention you don’t need interactivity I would assume it is not so much data exploration that you’re interested in, which makes me wonder why you would want to plot all of the data, given its size.

Please check this GR solution, using shade() to plot a large 1D time series.

Or this other GR example using shadepoints(). By using Float32 , more than 1 billion points can be processed in seconds:

4 Likes

Can you show a small example of the plot you want to improve?
Do you plot trajectories (i.e. lines) or would a histogram (1D / 2D) make sense instead?

1 Like

Perhaps this helps:

Online algorithm - Wikipedia

1 Like

You probably want to downsample your data before plotting.

The thesis “Downsampling Time Series for Visual Representation” by Sveinn Steinarsson investigates different visaually pleasing subsampling strategies (see here).
If I remeber correctly the “Largest Triangle Three Bucket” (LTTB) algorithm is recommended.

Unfortunately, I could only find implementations in java and python

See here from @stevengj repley below.

1 Like

There are a lot of different algorithms for downsampling huge timeseries datasets for visualization. See this post, for example (including some code).

If the downsampling algorithm is local (as is the case in the linked example above), you can process the dataset in chunks if the whole thing doesn’t fit into memory at once.

3 Likes

Yep, I saw that post when it arrived and is on my todo list to adopt something like that in GMT.jl. But user mentioned orbits which makes think the problem might depend on x,y coordinates. What I had in mind is the use of the GMT module blockmean to compute a grid with some statistics of the orbits into a grid that could later be easily and quick turned into a plot.
I’m not sure right now (would need to check the C code) but I think blockmean does a record-by-record reading so if data is in a disk file, it will take a wile but RAM memory should not be a problem (if it is, the procedure would have to be cut in chunks).

1 Like

I can’t sample less frequently because my orbit is spiraling inwards. It becomes smaller and smaller so if I sample less frequently I get very few data points per orbit which can’t really map the orbit accurately.
I am aware of the memory problem so I have rented a virtual machine with lots of memory to run this code.

That’s why downsampling strategies often need to be adaptive, so they sample more frequently when the data is changing more.

2 Likes

It is for a publication. I don’t want to plot all of the data, I just can’t find a solution that helps me preserve the signal and create a publication-quality plot. (I also want to overlay other signals on the same plot)

For example, here are two recent adaptive downsampling algorithms; see also the references therein:

Not sure what free implementations are out there of this kind of technique, though; I couldn’t find anything even in Python, but maybe I was searching the wrong keywords. (e.g. Pandas.resample uses fixed bin sizes.)

1 Like

If it is permissible, post a short exerpt from the data file[s] you want to show along with (if you know, or just guess) any characterizing info (e.g. the largest, smallest and mean values in each data vector). It may help others to give more specific suggestions if you posted even a very coarse plot (one point each million) – or better yet, if there are other plots with a similar look that you could link.

1 Like


This is the plot for a part of the orbit. The full plot will be very similar to it, the yellow curve will just be spanning a bit more of the x-axis

Is this correct?

And with your data, each of the two “curves” is given as a sequence of (x,y) coordinate pairs. And each “curve” may be determined by, say, 50 billion pairs; and the pairs are stored in some sensible sequence, so the (x,y) pair at index i1 bears some physical contiguity with the predecessor and successor pairs.

1 Like

Yes. The blue curve doesn’t have billions of data points. It’s just the noise curve for comparison. The yellow curve is theoretically calculated with a lot of data points but it is just a simple x,y line plot. The GR.shade example suggested by @rafael.guerra works well for this but it just doesn’t produce a publication quality plot.