Hi,
I wonder if it would make sense (from the performance point of view) to be able to use GLMakie.jl
with CuArray
s from CUDA.jl
?
Yes would be lovely and should be very efficient - but isn’t currently easy.
First step is to figure out if the API parts for OpenGL interop are wrapped in CUDA.jl ( I think not, we should open an issue), and then we need to port the samples in CUDA Samples :: CUDA Toolkit Documentation to Julia + CUDA… Makie should mostly work with GLBuffers or GLTextures directly, so after they got created and hooked up with CUDA.jl, there shouldn’t be a big problem to plot those
Actually they’re wrapped! Let me see if I can create a simple example. What do you actually want to plot? Sadly, surfaces, meshes and scatter will need quite the different setup to share the resources with CUDA.
Hi Simon,
Here is an example which is unfortunately a bit too long for the 7-lines thread…
Anyway, It takes about 5s on my machine to run and show a rather convincing combination of CUDA.jl
and Makie.jl
.
In the plot (can be heatmap
or surface
), I copy the CuArray zs
to a standard Array czs
…
using CUDA,GLMakie,AbstractPlotting
const T=Float32
const V=CuArray{T,2}
function go()
CUDA.allowscalar(false)
ls,ns,ms,λ,dt,sr,k,nt,two,eight,fr =1.e-2,800,1.e-4,0.1,2.e-7,0.1,5e7,5000,T(2),T(8),10
dt2sm=dt^2/ms
r,r0,r1,r2=1:ns,1:ns-2,2:ns-1,3:ns
cxs = [T(j-1)*ls for i in r, j in r]
dx(i,j) = cxs[i,j]-cxs[nsĂ·2,nsĂ·2]
cxc = [cxs[i,j]+sr*(dx(i,j))*exp(-0.5(dx(i,j)^2+dx(j,i)^2)/λ^2)/λ for i in r, j in r]
xs,xc,ys,yc=V(cxs),V(cxc),V(cxs'),V(cxc')
# ys,yc=collect(xs'),collect(xc')
xp,yp,xt,yt,fx,fy,zs=copy(xc),copy(yc),copy(xc),copy(yc),zero(xc),zero(xc),zero(xc)
@. zs = sqrt((xc-xs)^2+(yc-ys)^2)
czs=Array(zs) ; zn=Node(czs) ; zl=sr*0.05
scene = Scene(resolution = (400,400)) ; scale!(scene, 1, 1, 10000)
heatmap!(scene,1:ns,1:ns,lift(z->z,zn),colorrange = (-zl,zl))
# surface!(scene,1:ns,1:ns,lift(z->z,zn),colorrange = (-zl,zl))
zlims!((-zl,zl))
display(scene)
GLMakie.record(scene, "output.gif", 1:ntĂ·fr, framerate=30) do j
# for j in 1:ntĂ·fr
for i in 1:fr
@views @. fx[r1,r1] = -k*(eight*xc[r1,r1]-xc[r0,r0]-xc[r1,r0]-xc[r2,r0]-xc[r0,r1]-xc[r2,r1]-xc[r0,r2]-xc[r1,r2]-xc[r2,r2])
@views @. fy[r1,r1] = -k*(eight*yc[r1,r1]-yc[r0,r0]-yc[r1,r0]-yc[r2,r0]-yc[r0,r1]-yc[r2,r1]-yc[r0,r2]-yc[r1,r2]-yc[r2,r2])
@. xt = two*xc-xp+fx*dt2sm
@. yt = two*yc-yp+fy*dt2sm
xc,xp,xt = xt,xc,xp
yc,yp,yt = yt,yc,yp
end
@. zs = sqrt((xc-xs)^2+(yc-ys)^2)
czs=Array(zs)
zn[] = czs
yield()
end
end
@time go()
Here’s an example on how to keep the data on the GPU:
using GLMakie, CUDA
using AbstractFFTs
using GLMakie: GLAbstraction
function plot(T=Float32, N=1024*1024) # 1024*1024 Float32's = 8MiB
# generate dummy data: a simple gaussian
# NOTE: for demonstration purposes, we're keeping the time series on the CPU
t = range(-5, 5, length=N)
f = CuArray{T}(exp.(-t.^2))
## initialization
#
# this should be only done once, keeping the buffer and resources across frames
# get a buffer object and register it with CUDA
buffer = GLAbstraction.GLBuffer(eltype(f), length(f))
resource = let
ref = Ref{CUDA.CUgraphicsResource}()
CUDA.cuGraphicsGLRegisterBuffer(ref, buffer.id,
CUDA.CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD)
ref[]
end
## main processing
#
# this needs to be done every time we get new data and want to plot it
plot = NVTX.@range "main" begin
NVTX.@range "CUDA" CUDA.@sync begin # doesn't actually need @sync, just for timings
# map OpenGL buffer object for writing from CUDA
CUDA.cuGraphicsMapResources(1, [resource], stream())
# get a CuArray object that we can work with
array = let
ptr_ref = Ref{CUDA.CUdeviceptr}()
numbytes_ref = Ref{Csize_t}()
CUDA.cuGraphicsResourceGetMappedPointer_v2(ptr_ref, numbytes_ref, resource)
ptr = reinterpret(CuPtr{T}, ptr_ref[])
len = Int(numbytes_ref[] Ă· sizeof(T))
unsafe_wrap(CuArray, ptr, len)
end
# example processing: compute the FFT
# NOTE: real applications will want to perform a pre-planned in-place FFT
F = fft(f)
# shift the zero frequency component to the center
array[1:NĂ·2] .= abs.(view(F, NĂ·2+1:N))
array[NĂ·2+1:N] .= abs.(view(F, 1:NĂ·2))
CUDA.cuGraphicsUnmapResources(1, [resource], stream())
end
NVTX.@range "Makie" lines(t, buffer)
end
## clean-up
CUDA.cuGraphicsUnregisterResource(resource)
return plot
end
function main()
# XXX: we need create a screen, which initializes a GL Context,
# so that we can create a GLBuffer before having rendered anything.
GLMakie.Screen()
save("plot.png", plot())
return
end
I was experimenting with this in order to create a waterfall plot for data that resides on the GPU (hence the fft
), but performance isn’t great: Processing the 8MB dummy data data using CUDA takes 500us, rendering that using GLMakie takes 500ms… Anything obvious I’m doing wrong here?
Multiple things that look problematic:
- I’m not sure how you time the rendering, but
save("plot.png", plot())
does quite a bit more than just rendering one frame, so I guess 500ms could be (almost) plausible. Rendering exactly one frame takes a bit more manual setup -
GLMakie.Screen()
+save("plot.png", plot())
isn’t guaranteed to use the same context -
lines(t, buffer)
will run into the conversion machine and will just collect the buffer I’m afraid. I will need to double check, butGLBuffer(Point2f[...])
may get to OpenGL without getting converted
I can take a look later if I can update the example to correctly render one frame and use the same context and not convert the GLBuffer to an array
I was timing the NVTX ranges, so it should exclude the saving. But yeah, I’m new to Makie, so probably doing some silly mistakes If this ends up working properly I’ll add the necessary high-level wrappers to CUDA.jl so that we can avoid some of the boilerplate here.
When tracing with NSight, I didn’t see any copies to CPU memory (i.e., no DtoH copies), so I don’t think it converted back.
When tracing with NSight, I didn’t see any copies to CPU memory
Does that trace OpenGL Buffer copies? The OpenGL backend can’t render separated x, y buffers, so if the plot produces an output, it should copy it to CPU…
Turns out there was some “scalar indexing” going on, where individual elements of the GPU buffers were being copied to the CPU for processing. By using GLBuffer{Point2f}
directly, and using low-level APIs that avoid some of the automatic processing (e.g. center!
inspects data to determine limits, lines!
does a map
to detect invalid vertices) I have a working example that keeps all data on the GPU:
function plot(; T=Float32, N=1024, resolution=(800,600))
t = CUDA.rand(T, N)
X = CUDA.rand(T, N)
## initialization
#
# this should be only done once, keeping the buffer and resources across frames
# XXX: we need create a screen, which initializes a GL Context,
# so that we can create a GLBuffer before having rendered anything.
screen = GLMakie.global_gl_screen(resolution, true)
# get a buffer object and register it with CUDA
buffer = GLAbstraction.GLBuffer(Point2f, N)
resource = let
ref = Ref{CUDA.CUgraphicsResource}()
CUDA.cuGraphicsGLRegisterBuffer(ref, buffer.id,
CUDA.CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD)
ref[]
end
# NOTE: Makie's out-of-place API (lines, scatter) performs may iterating operations,
# like determining the range of the data, so we use a manual scene instead.
scene = Scene(; resolution)
cam2d!(scene)
# XXX: manually position the cameral (`center!` would iterate data)
cam = Makie.camera(scene)
cam.projection[] = Makie.orthographicprojection(
#= x =# 0f0, 1f0,
#= y =# 0f0, 1f0,
#= z =# 0f0, 1f0)
## main processing
#
# this needs to be done every time we get new data and want to plot it
NVTX.@range "main" begin
# process data, generate points
NVTX.@range "CUDA" begin
# map OpenGL buffer object for writing from CUDA
CUDA.cuGraphicsMapResources(1, [resource], stream())
# get a CuArray object that we can work with
array = let
ptr_ref = Ref{CUDA.CUdeviceptr}()
numbytes_ref = Ref{Csize_t}()
CUDA.cuGraphicsResourceGetMappedPointer_v2(ptr_ref, numbytes_ref, resource)
ptr = reinterpret(CuPtr{Point2f}, ptr_ref[])
len = Int(numbytes_ref[] Ă· sizeof(Point2f))
unsafe_wrap(CuArray, ptr, len)
end
# generate points
broadcast!(array, t, X) do x, y
Point2f(x, y)
end
# wait for the GPU to finish
synchronize()
CUDA.cuGraphicsUnmapResources(1, [resource], stream())
end
# generate and render plot
NVTX.@range "Makie" begin
scatter!(scene, buffer)
# force everything to render (for benchmarking purposes)
GLMakie.render_frame(screen, resize_buffers=false)
GLMakie.glFinish()
end
end
save("plot.png", scene)
## clean-up
CUDA.cuGraphicsUnregisterResource(resource)
return
end
This performs well, doing all the rendering in a couple of 100s of us. It requires https://github.com/JuliaPlots/Makie.jl/pull/1803, and I’d also recommend to disable gpu_getindex
so that scalar iteration of GLBuffer (a performance trap) is disallowed:
@eval GLMakie.GLAbstraction begin
# XXX: to make scalar iteration error
function gpu_getindex(b::GLBuffer{T}, range::UnitRange) where T
error("GLBuffer getindex") # XXX: for development
multiplicator = sizeof(T)
offset = first(range)-1
value = Vector{T}(undef, length(range))
bind(b)
glGetBufferSubData(b.buffertype, multiplicator*offset, sizeof(value), value)
bind(b, 0)
return value
end
end
It’s too bad that GLBuffer doesn’t support many array operations to make, e.g., lines!
work properly. I wonder if it wouldn’t be better to pass CuArray
s into Makie and have it do the necessary GL Interop calls automatically (which would make is possible to keep the CuArray
around longer, and perform array operations on it, instead of eagerly converting it to a GLBuffer
).
Any chance there has been further progress along this path? I am trying to implement CUDA/OpenGL graphics interop (i.e. to compute an array using CUDA and then render the array of data as a texture drawn on a rectangle in an OpenGL - or Makie - window).
Any chance you can provide more specific pointers on how to proceed?
For specific comparison code, a C version is available at:
Live Display via Graphics Interop | CUDA for Engineers: 2D Grids and Interactive Graphics | InformIT
When attempting to run the nice example of LaurentPlagne on 2nd Oct 24 I first added AbstractPlotting and then rm it having see it is deprecated and not needed. I obtained the undefined error for Node with and without the call to CUDA allowscalar.
I am using Julia 1.10.5 in a win 11 Power Shell with NVIDIA+CUDA.jl
Thanks - Kevin
Can you try it with a new pkg env?
And if that doesn’t work, show what’s in the environment?
https://pkgdocs.julialang.org/v1/environments/
Ok, I did that starting from scratch in the environments subfolder. Then ran the LaurentPlagne example and obtained the undefined “Node” error again.
Here is the error and status:
julia> @time go()
ERROR: UndefVarError: Node
not defined
Stacktrace:
[1] go()
@ Main .\REPL[42]:13
[2] macro expansion
@ .\timing.jl:279 [inlined]
[3] top-level scope
@ .\REPL[48]:1
(@v1.10) pkg> status
Status C:\Users\kab\.julia\environments\v1.10\Project.toml
[052768ef] CUDA v5.5.2
[0376089a] ClimaOcean v0.1.3
[e9467ef8] GLMakie v0.10.12
⌅ [9e8cae18] Oceananigans v0.90.14
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use status --outdated
I did not of course add AbstractPlotting. Actually examples from the Oceananigans set with CUDA and GLMakie, using GLMakie.record in place of record and surroundng the plotting instructions with “CUDA.allowscalars() do … end” did work (that was before my going astray with AbstractPlotting ).
Ahh, sorry, when I read your post it looked like a new topic, so I didn’t realize you’re referring to an example just above.
Node is now Observable
.
And there are a few other changes, but actually surprisingly little:
using CUDA,GLMakie
const T=Float32
const V=CuArray{T,2}
CUDA.allowscalar(false)
ls,ns,ms,λ,dt,sr,k,nt,two,eight,fr =1.e-2,800,1.e-4,0.1,2.e-7,0.1,5e7,5000,T(2),T(8),10
dt2sm=dt^2/ms
r,r0,r1,r2=1:ns,1:ns-2,2:ns-1,3:ns
cxs = [T(j-1)*ls for i in r, j in r]
dx(i,j) = cxs[i,j]-cxs[nsĂ·2,nsĂ·2]
cxc = [cxs[i,j]+sr*(dx(i,j))*exp(-0.5(dx(i,j)^2+dx(j,i)^2)/λ^2)/λ for i in r, j in r]
xs,xc,ys,yc=V(cxs),V(cxc),V(cxs'),V(cxc')
# ys,yc=collect(xs'),collect(xc')
xp,yp,xt,yt,fx,fy,zs=copy(xc),copy(yc),copy(xc),copy(yc),zero(xc),zero(xc),zero(xc)
@. zs = sqrt((xc-xs)^2+(yc-ys)^2)
czs=Array(zs) ; zn=Observable(czs) ; zl=sr*0.05
scene = Scene(resolution = (400,400)) ; scale!(scene, 1, 1, 10000)
cam3d!(scene)
surface!(scene, 1:ns, 1:ns, zn, colorrange=(-zl, zl))
center!(scene)
display(scene)
for j in 1:ntĂ·fr
# for j in 1:ntĂ·fr
for i in 1:fr
@views @. fx[r1,r1] = -k*(eight*xc[r1,r1]-xc[r0,r0]-xc[r1,r0]-xc[r2,r0]-xc[r0,r1]-xc[r2,r1]-xc[r0,r2]-xc[r1,r2]-xc[r2,r2])
@views @. fy[r1,r1] = -k*(eight*yc[r1,r1]-yc[r0,r0]-yc[r1,r0]-yc[r2,r0]-yc[r0,r1]-yc[r2,r1]-yc[r0,r2]-yc[r1,r2]-yc[r2,r2])
@. xt = two*xc-xp+fx*dt2sm
@. yt = two*yc-yp+fy*dt2sm
xc,xp,xt = xt,xc,xp
yc,yp,yt = yt,yc,yp
end
@. zs = sqrt((xc-xs)^2+(yc-ys)^2)
czs=Array(zs)
zn[] = czs
yield()
end
And here a more modern version:
czs=Array(zs) ; zn=Observable(czs) ; zl=sr*0.05
f, ax, pl = surface(1:ns, 1:ns, zn, colorrange=(-zl, zl), axis=(; type=Axis3))