Gadfly scatter plot crashing

Hello, I have a data frame with 2 columns, each has 2 millions points. When I try to plot it crashes Jupyter. I am surprised because R and Python can handle it.

Here is my command

df2=CSV.read(“/home/alessandro/Data/MAexperiment/24-07-2023_Deepvariant_standing_variation_analysis_based_on_HiFi/RefCall_all_GQ_VAF”, DataFrame; header = false)

rename!(df2,[:RefCall,:VAF,:GQ])
t=Gadfly.plot(df2, x=“GQ”, y=“VAF”, Geom.point())

Is there something I can do? I feel like it should be able to handle it no? Or am I delusional?

Thanks

EDIT: I can plot the data using

using Plots
gr()
@df df2 Plots.scatter(:GQ, :VAF)

but this causes massive performance issues. I suspect therefore there is somethign wrong with Jupyter/Brave browser

EDIT 2: with Python it takes 576 ms to plot

EDIT 3: this works but it’s very slow, in a .jl script, takes around 1 minute. I am aware of time to first plot, but there is no improvement over time, it still takes seconds to plot.

using DataFrames

using CSV

df4=CSV.read(“/home/alessandro/Data/MAexperiment/24-07-2023_Deepvariant_standing_variation_analysis_based_on_HiFi/RefCall_all_GQ_VAF”, DataFrame; header = false)

rename!(df4,[:RefCall,:VAF,:GQ])

using Gadfly

t=Gadfly.plot(df4, x=“GQ”, y=“VAF”,Geom.point, Theme(background_color=color(“white”),grid_color=color(“white”)))

using Cairo

using Fontconfig

t|> PNG(“density.png”,30cm,25cm)

EDIT 4: it also crashes Pluto… So far only VisualStudio is able to render my plot, though very slowly. I don’t believe it’s my computer fault, here is my hardware:
product: Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
64GiB System Memory

My issue is reminiscent of this one: Why is Julia's graphics system so slow? - #32 by evan-wehi

I am not good enough in computing to understand all the implications. Can you tell me if what’s happening is actually normal behaviour and I shouldn’t use Julia for data exploration? I am really puzzled.

Thanks and sorry for the many edits, I kept doing research.

(disclaimer: not a plotting expert)

I suspect Gadfly might not be the right plotting package if you need to plot millions of points. Gadfly produces vector images, and in the case of a scatter plot the number of elements to render is linear in the number of points you’re plotting. Contrast with a raster based plotting package, where the plot image size doesn’t depend on the number of points in your scatter plot. Of course the image buffer has to be updated for each point you’re plotting, but the image size itself, the number of pixels to be drawn, etc., don’t grow with more points.

I think Julia is a great tool for data analysis! But perhaps try out a non-vector plotting package if you need to plot millions of points in a scatter plot. (Side note: why do you need to plot millions of points in a single scatter plot? That’s a very crowded plot, the points will be drastically overlapping each other and you won’t see the density of points anyway unless you make each point nearly transparent, in which case you’re basically just generating a 2d density plot anyway?)

Thanks, you are correct that using a 2d histogram is much more interesting! (and it works much better, it’s rendered almost instantaneously). I am switching to Plots, seem much faster indeed. If you have any other recommendation, don’t hesitate.

Glad that’s working for you!

Also just for curiosity I tried plotting 1 million points with Python+Pandas and found that it took only ~100ms for the plot call to return in the repl—but when I tried to view and manipulate the interactive plot pane, it froze and I got an OverflowError! I think plotting millions of points is just hard :slight_smile:

1 Like

Indeed
I work in genomics and we have millions, or even billions data points to deal with, in case you would wonder where that many points come from. They are just data from genomes, and genomes are, well huge. Or in fact very small … it depends on how you see it, since your whole genome can be stored on a simple usb stick.

I have never made up my mind around this “are our genomes small or big things?” :smiley:

1 Like

did you try a 2D histogram in Gadfly? that should be fast too. i’d also suggest trying out Makie too.

yes, the 2d histogram is much faster. i will check out Makie

Maybe try out:

Oh that seems VERY interesting

hem, how to display the plot, when in the REPL? (I gave up on Jupyter I find the integration very clunky and I spend more time arranging the cells and writing pretty paragraphs in Markdowns than actually analyzing the data.

For example, minimal example of Makie

f = Figure(backgroundcolor = :tomato)
Scene (800px, 600px):
0 Plots
0 Child Scenes

display(f) returns the same

which Makie backend are you using?

GLMakie

GLMakie

Hello
Okay it seems my conda and jupyter were broken, works like a charm now (no idea how / what I did broke)