CuArray + GLMakie

LaurentPlagne · December 27, 2020, 3:37pm

Hi,
I wonder if it would make sense (from the performance point of view) to be able to use GLMakie.jl with CuArrays from CUDA.jl ?

sdanisch · December 27, 2020, 6:58pm

Yes would be lovely and should be very efficient - but isn’t currently easy.
First step is to figure out if the API parts for OpenGL interop are wrapped in CUDA.jl ( I think not, we should open an issue), and then we need to port the samples in CUDA Samples :: CUDA Toolkit Documentation to Julia + CUDA… Makie should mostly work with GLBuffers or GLTextures directly, so after they got created and hooked up with CUDA.jl, there shouldn’t be a big problem to plot those

sdanisch · December 28, 2020, 12:35pm

Actually they’re wrapped! Let me see if I can create a simple example. What do you actually want to plot? Sadly, surfaces, meshes and scatter will need quite the different setup to share the resources with CUDA.

LaurentPlagne · December 28, 2020, 4:41pm

Hi Simon,
Here is an example which is unfortunately a bit too long for the 7-lines thread…
Anyway, It takes about 5s on my machine to run and show a rather convincing combination of CUDA.jl and Makie.jl.
In the plot (can be heatmap or surface), I copy the CuArray zs to a standard Array czs…

toto

using CUDA,GLMakie,AbstractPlotting
const T=Float32
const V=CuArray{T,2}
function go()
    CUDA.allowscalar(false)
    ls,ns,ms,λ,dt,sr,k,nt,two,eight,fr =1.e-2,800,1.e-4,0.1,2.e-7,0.1,5e7,5000,T(2),T(8),10
    dt2sm=dt^2/ms
    r,r0,r1,r2=1:ns,1:ns-2,2:ns-1,3:ns
    cxs = [T(j-1)*ls for i in r, j in r]
    dx(i,j) = cxs[i,j]-cxs[ns÷2,ns÷2]
    cxc = [cxs[i,j]+sr*(dx(i,j))*exp(-0.5(dx(i,j)^2+dx(j,i)^2)/λ^2)/λ  for i in r, j in r]
    xs,xc,ys,yc=V(cxs),V(cxc),V(cxs'),V(cxc')
    # ys,yc=collect(xs'),collect(xc')
    xp,yp,xt,yt,fx,fy,zs=copy(xc),copy(yc),copy(xc),copy(yc),zero(xc),zero(xc),zero(xc)
    @. zs = sqrt((xc-xs)^2+(yc-ys)^2) 
    czs=Array(zs) ; zn=Node(czs) ; zl=sr*0.05
    scene = Scene(resolution = (400,400)) ; scale!(scene, 1, 1, 10000)
    heatmap!(scene,1:ns,1:ns,lift(z->z,zn),colorrange = (-zl,zl))
    # surface!(scene,1:ns,1:ns,lift(z->z,zn),colorrange = (-zl,zl))
    zlims!((-zl,zl))
    display(scene)
    GLMakie.record(scene, "output.gif", 1:nt÷fr, framerate=30) do j
    # for j in 1:nt÷fr
        for i in 1:fr
            @views @. fx[r1,r1] = -k*(eight*xc[r1,r1]-xc[r0,r0]-xc[r1,r0]-xc[r2,r0]-xc[r0,r1]-xc[r2,r1]-xc[r0,r2]-xc[r1,r2]-xc[r2,r2])
            @views @. fy[r1,r1] = -k*(eight*yc[r1,r1]-yc[r0,r0]-yc[r1,r0]-yc[r2,r0]-yc[r0,r1]-yc[r2,r1]-yc[r0,r2]-yc[r1,r2]-yc[r2,r2])
            @. xt = two*xc-xp+fx*dt2sm
            @. yt = two*yc-yp+fy*dt2sm
            xc,xp,xt = xt,xc,xp
            yc,yp,yt = yt,yc,yp
        end
        @. zs = sqrt((xc-xs)^2+(yc-ys)^2)
        czs=Array(zs)
        zn[] = czs
        yield()
    end
end
@time go()

maleadt · April 1, 2022, 1:17pm

Here’s an example on how to keep the data on the GPU:

using GLMakie, CUDA
using AbstractFFTs

using GLMakie: GLAbstraction

function plot(T=Float32, N=1024*1024)   # 1024*1024 Float32's = 8MiB
    # generate dummy data: a simple gaussian
    # NOTE: for demonstration purposes, we're keeping the time series on the CPU
    t = range(-5, 5, length=N)
    f = CuArray{T}(exp.(-t.^2))


    ## initialization
    #
    # this should be only done once, keeping the buffer and resources across frames

    # get a buffer object and register it with CUDA
    buffer = GLAbstraction.GLBuffer(eltype(f), length(f))
    resource = let
        ref = Ref{CUDA.CUgraphicsResource}()
        CUDA.cuGraphicsGLRegisterBuffer(ref, buffer.id,
                                        CUDA.CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD)
        ref[]
    end


    ## main processing
    #
    # this needs to be done every time we get new data and want to plot it

    plot = NVTX.@range "main" begin
        NVTX.@range "CUDA" CUDA.@sync begin # doesn't actually need @sync, just for timings
            # map OpenGL buffer object for writing from CUDA
            CUDA.cuGraphicsMapResources(1, [resource], stream())

            # get a CuArray object that we can work with
            array = let
                ptr_ref = Ref{CUDA.CUdeviceptr}()
                numbytes_ref = Ref{Csize_t}()
                CUDA.cuGraphicsResourceGetMappedPointer_v2(ptr_ref, numbytes_ref, resource)

                ptr = reinterpret(CuPtr{T}, ptr_ref[])
                len = Int(numbytes_ref[] ÷ sizeof(T))

                unsafe_wrap(CuArray, ptr, len)
            end

            # example processing: compute the FFT
            # NOTE: real applications will want to perform a pre-planned in-place FFT
            F = fft(f)
            # shift the zero frequency component to the center
            array[1:N÷2] .= abs.(view(F, N÷2+1:N))
            array[N÷2+1:N] .= abs.(view(F, 1:N÷2))

            CUDA.cuGraphicsUnmapResources(1, [resource], stream())
        end

        NVTX.@range "Makie" lines(t, buffer)
    end


    ## clean-up

    CUDA.cuGraphicsUnregisterResource(resource)

    return plot
end

function main()
    # XXX: we need create a screen, which initializes a GL Context,
    #      so that we can create a GLBuffer before having rendered anything.
    GLMakie.Screen()

    save("plot.png", plot())
    return
end

I was experimenting with this in order to create a waterfall plot for data that resides on the GPU (hence the fft), but performance isn’t great: Processing the 8MB dummy data data using CUDA takes 500us, rendering that using GLMakie takes 500ms… Anything obvious I’m doing wrong here?

sdanisch · April 1, 2022, 1:27pm

Multiple things that look problematic:

I’m not sure how you time the rendering, but save("plot.png", plot()) does quite a bit more than just rendering one frame, so I guess 500ms could be (almost) plausible. Rendering exactly one frame takes a bit more manual setup
GLMakie.Screen() + save("plot.png", plot()) isn’t guaranteed to use the same context
lines(t, buffer) will run into the conversion machine and will just collect the buffer I’m afraid. I will need to double check, but GLBuffer(Point2f[...]) may get to OpenGL without getting converted

I can take a look later if I can update the example to correctly render one frame and use the same context and not convert the GLBuffer to an array

maleadt · April 1, 2022, 1:43pm

I was timing the NVTX ranges, so it should exclude the saving. But yeah, I’m new to Makie, so probably doing some silly mistakes If this ends up working properly I’ll add the necessary high-level wrappers to CUDA.jl so that we can avoid some of the boilerplate here.

maleadt · April 1, 2022, 2:49pm

When tracing with NSight, I didn’t see any copies to CPU memory (i.e., no DtoH copies), so I don’t think it converted back.

sdanisch · April 1, 2022, 4:14pm

When tracing with NSight, I didn’t see any copies to CPU memory

Does that trace OpenGL Buffer copies? The OpenGL backend can’t render separated x, y buffers, so if the plot produces an output, it should copy it to CPU…

maleadt · April 5, 2022, 9:01am

Turns out there was some “scalar indexing” going on, where individual elements of the GPU buffers were being copied to the CPU for processing. By using GLBuffer{Point2f} directly, and using low-level APIs that avoid some of the automatic processing (e.g. center! inspects data to determine limits, lines! does a map to detect invalid vertices) I have a working example that keeps all data on the GPU:

function plot(; T=Float32, N=1024, resolution=(800,600))
    t = CUDA.rand(T, N)
    X = CUDA.rand(T, N)


    ## initialization
    #
    # this should be only done once, keeping the buffer and resources across frames

    # XXX: we need create a screen, which initializes a GL Context,
    #      so that we can create a GLBuffer before having rendered anything.
    screen = GLMakie.global_gl_screen(resolution, true)

    # get a buffer object and register it with CUDA
    buffer = GLAbstraction.GLBuffer(Point2f, N)
    resource = let
        ref = Ref{CUDA.CUgraphicsResource}()
        CUDA.cuGraphicsGLRegisterBuffer(ref, buffer.id,
                                        CUDA.CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD)
        ref[]
    end

    # NOTE: Makie's out-of-place API (lines, scatter) performs may iterating operations,
    #       like determining the range of the data, so we use a manual scene instead.
    scene = Scene(; resolution)
    cam2d!(scene)

    # XXX: manually position the cameral (`center!` would iterate data)
    cam = Makie.camera(scene)
    cam.projection[] = Makie.orthographicprojection(
        #= x =# 0f0, 1f0,
        #= y =# 0f0, 1f0,
        #= z =# 0f0, 1f0)


    ## main processing
    #
    # this needs to be done every time we get new data and want to plot it

    NVTX.@range "main" begin
        # process data, generate points
        NVTX.@range "CUDA" begin
            # map OpenGL buffer object for writing from CUDA
            CUDA.cuGraphicsMapResources(1, [resource], stream())

            # get a CuArray object that we can work with
            array = let
                ptr_ref = Ref{CUDA.CUdeviceptr}()
                numbytes_ref = Ref{Csize_t}()
                CUDA.cuGraphicsResourceGetMappedPointer_v2(ptr_ref, numbytes_ref, resource)

                ptr = reinterpret(CuPtr{Point2f}, ptr_ref[])
                len = Int(numbytes_ref[] ÷ sizeof(Point2f))

                unsafe_wrap(CuArray, ptr, len)
            end

            # generate points
            broadcast!(array, t, X) do x, y
                Point2f(x, y)
            end

            # wait for the GPU to finish
            synchronize()

            CUDA.cuGraphicsUnmapResources(1, [resource], stream())
        end

        # generate and render plot
        NVTX.@range "Makie" begin
            scatter!(scene, buffer)

            # force everything to render (for benchmarking purposes)
            GLMakie.render_frame(screen, resize_buffers=false)
            GLMakie.glFinish()
        end

    end

    save("plot.png", scene)


    ## clean-up

    CUDA.cuGraphicsUnregisterResource(resource)

    return
end

This performs well, doing all the rendering in a couple of 100s of us. It requires https://github.com/JuliaPlots/Makie.jl/pull/1803, and I’d also recommend to disable gpu_getindex so that scalar iteration of GLBuffer (a performance trap) is disallowed:

@eval GLMakie.GLAbstraction begin
    # XXX: to make scalar iteration error
    function gpu_getindex(b::GLBuffer{T}, range::UnitRange) where T
        error("GLBuffer getindex")  # XXX: for development
        multiplicator = sizeof(T)
        offset = first(range)-1
        value = Vector{T}(undef, length(range))
        bind(b)
        glGetBufferSubData(b.buffertype, multiplicator*offset, sizeof(value), value)
        bind(b, 0)
        return value
    end
end

It’s too bad that GLBuffer doesn’t support many array operations to make, e.g., lines! work properly. I wonder if it wouldn’t be better to pass CuArrays into Makie and have it do the necessary GL Interop calls automatically (which would make is possible to keep the CuArray around longer, and perform array operations on it, instead of eagerly converting it to a GLBuffer).

duanestorti · July 26, 2022, 12:54am

Any chance there has been further progress along this path? I am trying to implement CUDA/OpenGL graphics interop (i.e. to compute an array using CUDA and then render the array of data as a texture drawn on a rectangle in an OpenGL - or Makie - window).

Any chance you can provide more specific pointers on how to proceed?

For specific comparison code, a C version is available at:
Live Display via Graphics Interop | CUDA for Engineers: 2D Grids and Interactive Graphics | InformIT

waywardpidgeon · October 1, 2024, 9:01pm

When attempting to run the nice example of LaurentPlagne on 2nd Oct 24 I first added AbstractPlotting and then rm it having see it is deprecated and not needed. I obtained the undefined error for Node with and without the call to CUDA allowscalar.

I am using Julia 1.10.5 in a win 11 Power Shell with NVIDIA+CUDA.jl
Thanks - Kevin

sdanisch · October 1, 2024, 9:07pm

Can you try it with a new pkg env?
And if that doesn’t work, show what’s in the environment?
https://pkgdocs.julialang.org/v1/environments/

waywardpidgeon · October 2, 2024, 8:53pm

Ok, I did that starting from scratch in the environments subfolder. Then ran the LaurentPlagne example and obtained the undefined “Node” error again.
Here is the error and status:

julia> @time go()
ERROR: UndefVarError: Node not defined
Stacktrace:
[1] go()
@ Main .\REPL[42]:13
[2] macro expansion
@ .\timing.jl:279 [inlined]
[3] top-level scope
@ .\REPL[48]:1

(@v1.10) pkg> status
Status C:\Users\kab\.julia\environments\v1.10\Project.toml
[052768ef] CUDA v5.5.2
[0376089a] ClimaOcean v0.1.3
[e9467ef8] GLMakie v0.10.12
⌅ [9e8cae18] Oceananigans v0.90.14
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use status --outdated

I did not of course add AbstractPlotting. Actually examples from the Oceananigans set with CUDA and GLMakie, using GLMakie.record in place of record and surroundng the plotting instructions with “CUDA.allowscalars() do … end” did work (that was before my going astray with AbstractPlotting ).

sdanisch · October 3, 2024, 8:55am

Ahh, sorry, when I read your post it looked like a new topic, so I didn’t realize you’re referring to an example just above.
Node is now Observable.
And there are a few other changes, but actually surprisingly little:

using CUDA,GLMakie
const T=Float32
const V=CuArray{T,2}
CUDA.allowscalar(false)
ls,ns,ms,λ,dt,sr,k,nt,two,eight,fr =1.e-2,800,1.e-4,0.1,2.e-7,0.1,5e7,5000,T(2),T(8),10
dt2sm=dt^2/ms
r,r0,r1,r2=1:ns,1:ns-2,2:ns-1,3:ns
cxs = [T(j-1)*ls for i in r, j in r]
dx(i,j) = cxs[i,j]-cxs[ns÷2,ns÷2]
cxc = [cxs[i,j]+sr*(dx(i,j))*exp(-0.5(dx(i,j)^2+dx(j,i)^2)/λ^2)/λ  for i in r, j in r]
xs,xc,ys,yc=V(cxs),V(cxc),V(cxs'),V(cxc')
# ys,yc=collect(xs'),collect(xc')
xp,yp,xt,yt,fx,fy,zs=copy(xc),copy(yc),copy(xc),copy(yc),zero(xc),zero(xc),zero(xc)
@. zs = sqrt((xc-xs)^2+(yc-ys)^2) 
czs=Array(zs) ; zn=Observable(czs) ; zl=sr*0.05
scene = Scene(resolution = (400,400)) ; scale!(scene, 1, 1, 10000)
cam3d!(scene)
surface!(scene, 1:ns, 1:ns, zn, colorrange=(-zl, zl))

center!(scene)
display(scene)
for j in 1:nt÷fr
# for j in 1:nt÷fr
    for i in 1:fr
        @views @. fx[r1,r1] = -k*(eight*xc[r1,r1]-xc[r0,r0]-xc[r1,r0]-xc[r2,r0]-xc[r0,r1]-xc[r2,r1]-xc[r0,r2]-xc[r1,r2]-xc[r2,r2])
        @views @. fy[r1,r1] = -k*(eight*yc[r1,r1]-yc[r0,r0]-yc[r1,r0]-yc[r2,r0]-yc[r0,r1]-yc[r2,r1]-yc[r0,r2]-yc[r1,r2]-yc[r2,r2])
        @. xt = two*xc-xp+fx*dt2sm
        @. yt = two*yc-yp+fy*dt2sm
        xc,xp,xt = xt,xc,xp
        yc,yp,yt = yt,yc,yp
    end
    @. zs = sqrt((xc-xs)^2+(yc-ys)^2)
    czs=Array(zs)
    zn[] = czs
    yield()
end

And here a more modern version:

czs=Array(zs) ; zn=Observable(czs) ; zl=sr*0.05
f, ax, pl = surface(1:ns, 1:ns, zn, colorrange=(-zl, zl), axis=(; type=Axis3))

https://github.com/user-attachments/assets/28fa8b53-0c07-4e7b-b8ae-53c7ca024028(image larger than 4 MB)

Topic		Replies	Views
Animated plot and CUDA computations with Makie GPU question , plotting , cuda , makie	2	1031	October 26, 2020
10 minute physics tutorial part 16 (Cloth simulation) translated to Julia Community gpu , makie	1	782	May 6, 2024
Demo: Example of CUDA/OpenGL interop in Julia GPU graphics , cuda	0	1762	September 26, 2022
OpenGL rendering with CUDA GPU gpu , cuda , visualization	2	1187	June 1, 2022
Real-time graphics rendering (generative art) Visualization question , package , plotting	18	670	May 28, 2025

CuArray + GLMakie

Related topics