Gadfly scatter plot crashing

Axze-rgb · July 25, 2023, 9:55am

Hello, I have a data frame with 2 columns, each has 2 millions points. When I try to plot it crashes Jupyter. I am surprised because R and Python can handle it.

Here is my command

df2=CSV.read(“/home/alessandro/Data/MAexperiment/24-07-2023_Deepvariant_standing_variation_analysis_based_on_HiFi/RefCall_all_GQ_VAF”, DataFrame; header = false)

rename!(df2,[:RefCall,:VAF,:GQ])
t=Gadfly.plot(df2, x=“GQ”, y=“VAF”, Geom.point())

Is there something I can do? I feel like it should be able to handle it no? Or am I delusional?

Thanks

EDIT: I can plot the data using

using Plots
gr()
@df df2 Plots.scatter(:GQ, :VAF)

but this causes massive performance issues. I suspect therefore there is somethign wrong with Jupyter/Brave browser

EDIT 2: with Python it takes 576 ms to plot

EDIT 3: this works but it’s very slow, in a .jl script, takes around 1 minute. I am aware of time to first plot, but there is no improvement over time, it still takes seconds to plot.

using DataFrames

using CSV

df4=CSV.read(“/home/alessandro/Data/MAexperiment/24-07-2023_Deepvariant_standing_variation_analysis_based_on_HiFi/RefCall_all_GQ_VAF”, DataFrame; header = false)

rename!(df4,[:RefCall,:VAF,:GQ])

using Gadfly

t=Gadfly.plot(df4, x=“GQ”, y=“VAF”,Geom.point, Theme(background_color=color(“white”),grid_color=color(“white”)))

using Cairo

using Fontconfig

t|> PNG(“density.png”,30cm,25cm)

EDIT 4: it also crashes Pluto… So far only VisualStudio is able to render my plot, though very slowly. I don’t believe it’s my computer fault, here is my hardware:
product: Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
64GiB System Memory

My issue is reminiscent of this one: Why is Julia's graphics system so slow? - #32 by evan-wehi

I am not good enough in computing to understand all the implications. Can you tell me if what’s happening is actually normal behaviour and I shouldn’t use Julia for data exploration? I am really puzzled.

Thanks and sorry for the many edits, I kept doing research.

evanfields · July 25, 2023, 2:29pm

(disclaimer: not a plotting expert)

I suspect Gadfly might not be the right plotting package if you need to plot millions of points. Gadfly produces vector images, and in the case of a scatter plot the number of elements to render is linear in the number of points you’re plotting. Contrast with a raster based plotting package, where the plot image size doesn’t depend on the number of points in your scatter plot. Of course the image buffer has to be updated for each point you’re plotting, but the image size itself, the number of pixels to be drawn, etc., don’t grow with more points.

I think Julia is a great tool for data analysis! But perhaps try out a non-vector plotting package if you need to plot millions of points in a scatter plot. (Side note: why do you need to plot millions of points in a single scatter plot? That’s a very crowded plot, the points will be drastically overlapping each other and you won’t see the density of points anyway unless you make each point nearly transparent, in which case you’re basically just generating a 2d density plot anyway?)

Axze-rgb · July 25, 2023, 2:39pm

Thanks, you are correct that using a 2d histogram is much more interesting! (and it works much better, it’s rendered almost instantaneously). I am switching to Plots, seem much faster indeed. If you have any other recommendation, don’t hesitate.

evanfields · July 25, 2023, 2:41pm

Glad that’s working for you!

Also just for curiosity I tried plotting 1 million points with Python+Pandas and found that it took only ~100ms for the plot call to return in the repl—but when I tried to view and manipulate the interactive plot pane, it froze and I got an OverflowError! I think plotting millions of points is just hard

Axze-rgb · July 25, 2023, 2:44pm

Indeed
I work in genomics and we have millions, or even billions data points to deal with, in case you would wonder where that many points come from. They are just data from genomes, and genomes are, well huge. Or in fact very small … it depends on how you see it, since your whole genome can be stored on a simple usb stick.

I have never made up my mind around this “are our genomes small or big things?”

bjarthur · July 25, 2023, 3:19pm

did you try a 2D histogram in Gadfly? that should be fast too. i’d also suggest trying out Makie too.

Axze-rgb · July 25, 2023, 3:23pm

yes, the 2d histogram is much faster. i will check out Makie

Rudi79 · July 25, 2023, 3:38pm

Maybe try out:

Axze-rgb · July 25, 2023, 5:13pm

Oh that seems VERY interesting

Axze-rgb · July 25, 2023, 8:40pm

hem, how to display the plot, when in the REPL? (I gave up on Jupyter I find the integration very clunky and I spend more time arranging the cells and writing pretty paragraphs in Markdowns than actually analyzing the data.

For example, minimal example of Makie

f = Figure(backgroundcolor = :tomato)
Scene (800px, 600px):
0 Plots
0 Child Scenes

display(f) returns the same

bjarthur · July 25, 2023, 9:06pm

which Makie backend are you using?

Axze-rgb · July 25, 2023, 9:12pm

GLMakie

Axze-rgb · July 26, 2023, 2:21am

GLMakie

Axze-rgb · July 28, 2023, 6:46am

Hello
Okay it seems my conda and jupyter were broken, works like a charm now (no idea how / what I did broke)

Topic		Replies	Views
Gadfly Performance Issue Visualization gadfly	2	701	August 7, 2018
Gadfly, the native Julia statistical plotting library, adds Julia 1.0 support! Community gadfly	10	2981	December 1, 2021
Dealing with large numbers of points Visualization	13	3128	November 22, 2018
Gadfly DataFrame "cannot convert" New to Julia	2	1427	July 2, 2020
Gadfly is the best plotting library I've ever used Visualization gadfly	15	7093	September 18, 2023

Gadfly scatter plot crashing

Related topics