How to plot ~10 billion datapoint time series efficiently?

I am plotting a simple x,y line plot. See the partial plot I posted in reply to the other post

So, keeping it simple. For example:

julia> using Plots

julia> x = 0:1e-6:2Ļ€ ; y = sin.(x); # ~2Ļ€ million points

julia> plot(layout=(2,1))

julia> plot!(x,y,linewidth=2,subplot=1)

julia> sample = 1:1000:length(x);

julia> plot!(x[sample],y[sample],linewidth=2,subplot=2)

Gives you two essentially identical plots, but the second one has 1/1000 of the points:

image

You perhaps need a smarter sampling strategy if some part of the plot doesnā€™t look like it should, but that can even be done ā€œby handā€.

1 Like

Yes, this is the first thing I tried but it doesnā€™t preserve the shape of the data. Do you have any suggestions for a smarter sampling strategy?

I was thinking that if you are ok with that syntax, you could simply concatenate by hand different sampling frequencies tuned to the data variation in each region. That, of course, because it seems that you are only doing this for a single figure. For a more automated and sophisticated solution, we should implement one smarter sampling method as suggested above.

2 Likes

If your plot is log-spaced, your sampling indices should likewise be log-spaced using something like this:

julia> function logsample(x, y, N)
           length(x) == length(y) || ArgumentError("Mismatched input sizes")
           idx = unique([round(Int, exp10(v)) for v in LinRange(0, log10(length(x)), N)])
           x[idx], y[idx]
       end
1 Like

I recall people being satisfied with the publication quality graphs they constructed using PyPlot.jl. It wraps matplotlib, which is a well known and highly developed Python plotting package.

I donā€™t understand why GR should not produce publication quality plots. Could you provide a MWE?

I suspect itā€™s because that plot makes a heatmap (with a very colorful color map), which doesnā€™t reproduce well in print. For the 2D line data of OP a (simplified) line plot would probably be best in terms of plot quality.

Yes, PyPlot is nice but it canā€™t plot 10 billion data points efficiently.

GR.shade is producing a grainy curve since it is sort of a heatmap of the data

GMT letā€™s you plot as many billion points as you want, but itā€™s constrained to the same memory conditions as the others so you would have to do it by chunks and expect ghostscript is able to convert the huge resulting postscript file.

However, I maintain that that no-one is able to see billions of points in a plot so donā€™t understand this request.

1 Like

One question then is how long are you willing to wait for the plot to finish, and how often do you need to repeat it? Or is the memory requirement the issue (as PyPlot canā€™t stream the data, donā€™t know actually)?

The goal is probably not to directly show all billions of points, but to produce faithful visual representation of the underlying data that shows the overall trajectory in a certain amount of detail. And the challenge here is that the underlying data isnā€™t straightforward to subsample/decimate without losing the overall shape of the curve. Or at least, it takes a specific type of decimation algorithm, like one of the algorithms/papers mentioned earlier (or something like CGAL 5.6 - 2D Polyline Simplification: User Manual). And plotting as a heat-map doesnā€™t make much sense for a trajectory.

So brute-forcing plotting the 2D line segments can still be the best approach as it is simple (but somewhat time-consuming), while not introducing any smoothing artifacts that decimation would most likely introduce.

Also, I somewhat feel that doing a simulation at such a high resolution and saving that data, and then not looking at the evolution in its full resolution seems a waste. Usually only a subset of simulation timesteps is saved (mostly due to storage restrictions), which provide enough insight into the system behaviour, and then OPā€™s problem would simply not exist. But apparently thereā€™s a reason here to save such a high number of time steps and visualize them.

Edit: bad grammar fixes, missing word, clearified question

3 Likes

Sure, but the point is that they canā€™t be visualized all at same time.

That depends on your definition of ā€œat the same timeā€. If you take those billions of points and represent them with a heat map (which would involve mapping/plotting them all) then the plot still shows them ā€œall at the same timeā€ :wink: And making a heatmap of a subset of the data will most likely lead to a different plot output (depending on the plot setup). But I agree that for the goal of providing an overview of the data itā€™s not very useful to directly plot all those points.

Iā€™ve seen examples of N-body simulations of billions of mass particles where an interactive viewer was built to be able to visualize both the high-level structure, as well as being able to zoom down to the individual point level. Yet, such exploratory visualization doesnā€™t seem to be the goal here.

Wow, that should be a daunting job. I can only admire and envy someone has a tool like this.

I would try something like this, basically I go over the vectors and add indices for plotting only when the angle changes significantly, so that if many points follow the same direction they are skipped. You might want to add a minimum step to make sure you reduce enough and also do something to deal with the log-scale.

function downsample(x, y, tol) 
    idx = [1]
    last_idx = 1
    last_angle = 0
    for i in 2:length(y) #don't use OffsetArray ;)
        Ī“y = y[i] - y[i-1]
        Ī“x = x[i] - x[i-1]
        angle = atan(Ī“y, Ī“x)
        if  abs(last_angle - angle) > tol 
            push!(idx, i)
            last_idx = i
            last_angle = angle
        end
    end
    idx
end

tol = 0.1
x = LinRange(0,2Ļ€, 10_000)
y = @. sin(x) + cos(2x)^4

idx = downsample(x, y, tol)

@show length(idx)
plot(x = x[idx], y = y[idx], Geom.line, Geom.point)

4 Likes

That is a nice idea. From quick test below, using the derivative instead of atan may provide better coverage for the same level of data reduction?

Code

function downsample(x, y, tol) 
    idx = [1]
    last_idx = 1
    last_angle = 0
    for i in 2:length(y) #don't use OffsetArray ;)
        Ī“y = y[i] - y[i-1]
        Ī“x = x[i] - x[i-1]
        angle = atan(Ī“y, Ī“x)
        if  abs(last_angle - angle) > tol 
            push!(idx, i)
            last_idx = i
            last_angle = angle
        end
    end
    idx
end

function downsample2(x, y, tol) 
    idx = [1]
    last_idx = 1
    last_derivative = 0
    for i in 2:length(y) #don't use OffsetArray ;)
        Ī“y = y[i] - y[i-1]
        Ī“x = x[i] - x[i-1]
        Ī“x==0 ? deriv = Inf : deriv = Ī“y/Ī“x
        if  abs(last_derivative - deriv) > tol 
            push!(idx, i)
            last_idx = i
            last_derivative = deriv
        end
    end
    idx
end

tol = 0.1
x = LinRange(0,2Ļ€, 10_000)
y = @. 2x * sin(x)^2 + 4x * cos(2x)^4

idx = downsample(x, y, 0.1)
idx2 = downsample2(x, y, 2.6)

using Plots
p1 = plot(x,y, c=:blue)
plot!(x[idx], y[idx], marker=:o, ms=2, ls=:dash, c=:red, title="atan angle ($(length(idx)) points)")
p2 = plot(x,y, c=:blue)
plot!(x[idx2], y[idx2], marker=:o, ms=2, ls=:dash, c=:green, title="derivative ($(length(idx2)) points)")
plot(p1,p2, size=(1000,500))

In any case, for examples like this, the use of FFT-based data compression is more common?

7 Likes

This blog post sounds relevant Drawing 2.7 billion Points in 10s | by Simon Danisch | HackerNoon.com | Medium

3 Likes

Guys, those Douglas-Peucker guys (sorry Ramer, I didnā€™t know about you) have worked on this

using GMT

x = LinRange(0,2Ļ€, 10_000);
y = @. 2x * sin(x)^2 + 4x * cos(2x)^4;
xy = gmtsimplify([x y], tol=0.005);
imshow(xy, marker=:point, lw=0.5, figsize=(12,12), title="Douglas-Peucker ($(size(xy,1)) points)")

5 Likes

Very ingenious but since the image has 600 x 960 pixels and continents are about 1/3 of Earth surface and more than half of if have no points, it turns out that of those 2.7 bilions points we are only seeing 600 * 960 / 3 / 2 ~ 100_000 of them.