How to plot ~10 billion datapoint time series efficiently?

hk69 · May 18, 2022, 9:16pm

I am plotting a simple x,y line plot. See the partial plot I posted in reply to the other post

lmiq · May 18, 2022, 9:33pm

So, keeping it simple. For example:

julia> using Plots

julia> x = 0:1e-6:2π ; y = sin.(x); # ~2π million points

julia> plot(layout=(2,1))

julia> plot!(x,y,linewidth=2,subplot=1)

julia> sample = 1:1000:length(x);

julia> plot!(x[sample],y[sample],linewidth=2,subplot=2)

Gives you two essentially identical plots, but the second one has 1/1000 of the points:

You perhaps need a smarter sampling strategy if some part of the plot doesn’t look like it should, but that can even be done “by hand”.

hk69 · May 18, 2022, 9:45pm

Yes, this is the first thing I tried but it doesn’t preserve the shape of the data. Do you have any suggestions for a smarter sampling strategy?

lmiq · May 18, 2022, 10:01pm

I was thinking that if you are ok with that syntax, you could simply concatenate by hand different sampling frequencies tuned to the data variation in each region. That, of course, because it seems that you are only doing this for a single figure. For a more automated and sophisticated solution, we should implement one smarter sampling method as suggested above.

stillyslalom · May 18, 2022, 10:01pm

If your plot is log-spaced, your sampling indices should likewise be log-spaced using something like this:

julia> function logsample(x, y, N)
           length(x) == length(y) || ArgumentError("Mismatched input sizes")
           idx = unique([round(Int, exp10(v)) for v in LinRange(0, log10(length(x)), N)])
           x[idx], y[idx]
       end

JeffreySarnoff · May 19, 2022, 1:41am

I recall people being satisfied with the publication quality graphs they constructed using PyPlot.jl. It wraps matplotlib, which is a well known and highly developed Python plotting package.

jheinen · May 19, 2022, 6:41am

I don’t understand why GR should not produce publication quality plots. Could you provide a MWE?

paulmelis · May 19, 2022, 7:16am

I suspect it’s because that plot makes a heatmap (with a very colorful color map), which doesn’t reproduce well in print. For the 2D line data of OP a (simplified) line plot would probably be best in terms of plot quality.

hk69 · May 19, 2022, 4:12pm

Yes, PyPlot is nice but it can’t plot 10 billion data points efficiently.

hk69 · May 19, 2022, 4:15pm

GR.shade is producing a grainy curve since it is sort of a heatmap of the data

joa-quim · May 19, 2022, 4:50pm

GMT let’s you plot as many billion points as you want, but it’s constrained to the same memory conditions as the others so you would have to do it by chunks and expect ghostscript is able to convert the huge resulting postscript file.

However, I maintain that that no-one is able to see billions of points in a plot so don’t understand this request.

paulmelis · May 19, 2022, 5:15pm

One question then is how long are you willing to wait for the plot to finish, and how often do you need to repeat it? Or is the memory requirement the issue (as PyPlot can’t stream the data, don’t know actually)?

The goal is probably not to directly show all billions of points, but to produce faithful visual representation of the underlying data that shows the overall trajectory in a certain amount of detail. And the challenge here is that the underlying data isn’t straightforward to subsample/decimate without losing the overall shape of the curve. Or at least, it takes a specific type of decimation algorithm, like one of the algorithms/papers mentioned earlier (or something like CGAL 5.6 - 2D Polyline Simplification: User Manual). And plotting as a heat-map doesn’t make much sense for a trajectory.

So brute-forcing plotting the 2D line segments can still be the best approach as it is simple (but somewhat time-consuming), while not introducing any smoothing artifacts that decimation would most likely introduce.

Also, I somewhat feel that doing a simulation at such a high resolution and saving that data, and then not looking at the evolution in its full resolution seems a waste. Usually only a subset of simulation timesteps is saved (mostly due to storage restrictions), which provide enough insight into the system behaviour, and then OP’s problem would simply not exist. But apparently there’s a reason here to save such a high number of time steps and visualize them.

Edit: bad grammar fixes, missing word, clearified question

joa-quim · May 19, 2022, 9:15pm

Sure, but the point is that they can’t be visualized all at same time.

paulmelis · May 20, 2022, 6:54am

That depends on your definition of “at the same time”. If you take those billions of points and represent them with a heat map (which would involve mapping/plotting them all) then the plot still shows them “all at the same time” And making a heatmap of a subset of the data will most likely lead to a different plot output (depending on the plot setup). But I agree that for the goal of providing an overview of the data it’s not very useful to directly plot all those points.

I’ve seen examples of N-body simulations of billions of mass particles where an interactive viewer was built to be able to visualize both the high-level structure, as well as being able to zoom down to the individual point level. Yet, such exploratory visualization doesn’t seem to be the goal here.

liuyxpp · May 20, 2022, 6:57am

Wow, that should be a daunting job. I can only admire and envy someone has a tool like this.

jonathanBieler · May 20, 2022, 8:45am

I would try something like this, basically I go over the vectors and add indices for plotting only when the angle changes significantly, so that if many points follow the same direction they are skipped. You might want to add a minimum step to make sure you reduce enough and also do something to deal with the log-scale.

function downsample(x, y, tol) 
    idx = [1]
    last_idx = 1
    last_angle = 0
    for i in 2:length(y) #don't use OffsetArray ;)
        δy = y[i] - y[i-1]
        δx = x[i] - x[i-1]
        angle = atan(δy, δx)
        if  abs(last_angle - angle) > tol 
            push!(idx, i)
            last_idx = i
            last_angle = angle
        end
    end
    idx
end

tol = 0.1
x = LinRange(0,2π, 10_000)
y = @. sin(x) + cos(2x)^4

idx = downsample(x, y, tol)

@show length(idx)
plot(x = x[idx], y = y[idx], Geom.line, Geom.point)

rafael.guerra · May 20, 2022, 9:32am

That is a nice idea. From quick test below, using the derivative instead of atan may provide better coverage for the same level of data reduction?

Code


function downsample(x, y, tol) 
    idx = [1]
    last_idx = 1
    last_angle = 0
    for i in 2:length(y) #don't use OffsetArray ;)
        δy = y[i] - y[i-1]
        δx = x[i] - x[i-1]
        angle = atan(δy, δx)
        if  abs(last_angle - angle) > tol 
            push!(idx, i)
            last_idx = i
            last_angle = angle
        end
    end
    idx
end

function downsample2(x, y, tol) 
    idx = [1]
    last_idx = 1
    last_derivative = 0
    for i in 2:length(y) #don't use OffsetArray ;)
        δy = y[i] - y[i-1]
        δx = x[i] - x[i-1]
        δx==0 ? deriv = Inf : deriv = δy/δx
        if  abs(last_derivative - deriv) > tol 
            push!(idx, i)
            last_idx = i
            last_derivative = deriv
        end
    end
    idx
end

tol = 0.1
x = LinRange(0,2π, 10_000)
y = @. 2x * sin(x)^2 + 4x * cos(2x)^4

idx = downsample(x, y, 0.1)
idx2 = downsample2(x, y, 2.6)

using Plots
p1 = plot(x,y, c=:blue)
plot!(x[idx], y[idx], marker=:o, ms=2, ls=:dash, c=:red, title="atan angle ($(length(idx)) points)")
p2 = plot(x,y, c=:blue)
plot!(x[idx2], y[idx2], marker=:o, ms=2, ls=:dash, c=:green, title="derivative ($(length(idx2)) points)")
plot(p1,p2, size=(1000,500))

In any case, for examples like this, the use of FFT-based data compression is more common?

drvi · May 20, 2022, 9:40am

This blog post sounds relevant Drawing 2.7 billion Points in 10s | by Simon Danisch | HackerNoon.com | Medium

joa-quim · May 20, 2022, 2:52pm

Guys, those Douglas-Peucker guys (sorry Ramer, I didn’t know about you) have worked on this

using GMT

x = LinRange(0,2π, 10_000);
y = @. 2x * sin(x)^2 + 4x * cos(2x)^4;
xy = gmtsimplify([x y], tol=0.005);
imshow(xy, marker=:point, lw=0.5, figsize=(12,12), title="Douglas-Peucker ($(size(xy,1)) points)")

joa-quim · May 20, 2022, 3:00pm

Very ingenious but since the image has 600 x 960 pixels and continents are about 1/3 of Earth surface and more than half of if have no points, it turns out that of those 2.7 bilions points we are only seeing 600 * 960 / 3 / 2 ~ 100_000 of them.

Topic		Replies	Views
How to plot very large numbers of points and save the figures with small latencies for raw audio images? Visualization plotting , inspectdr	17	2085	March 21, 2020
Need to plot high resolution time series data General Usage plotting	11	1302	December 24, 2020
GR vs GLVisualize comparison? General Usage	19	3459	July 18, 2017
State of Plotting Packages as of 2018/12/15 New to Julia plotting , packages	39	4645	June 10, 2019
Where is actual development in Plotting? Visualization	162	13140	November 22, 2017

How to plot ~10 billion datapoint time series efficiently?

Related topics