CuArray + GLMakie

Hi,
I wonder if it would make sense (from the performance point of view) to be able to use GLMakie.jl with CuArrays from CUDA.jl ?

2 Likes

Yes would be lovely and should be very efficient - but isn’t currently easy.
First step is to figure out if the API parts for OpenGL interop are wrapped in CUDA.jl ( I think not, we should open an issue), and then we need to port the samples in CUDA Samples :: CUDA Toolkit Documentation to Julia + CUDA… Makie should mostly work with GLBuffers or GLTextures directly, so after they got created and hooked up with CUDA.jl, there shouldn’t be a big problem to plot those :wink:

2 Likes

Actually they’re wrapped! Let me see if I can create a simple example. What do you actually want to plot? Sadly, surfaces, meshes and scatter will need quite the different setup to share the resources with CUDA.

Hi Simon,
Here is an example which is unfortunately a bit too long for the 7-lines thread…
Anyway, It takes about 5s on my machine to run and show a rather convincing combination of CUDA.jl and Makie.jl.
In the plot (can be heatmap or surface), I copy the CuArray zs to a standard Array czs

toto

using CUDA,GLMakie,AbstractPlotting
const T=Float32
const V=CuArray{T,2}
function go()
    CUDA.allowscalar(false)
    ls,ns,ms,λ,dt,sr,k,nt,two,eight,fr =1.e-2,800,1.e-4,0.1,2.e-7,0.1,5e7,5000,T(2),T(8),10
    dt2sm=dt^2/ms
    r,r0,r1,r2=1:ns,1:ns-2,2:ns-1,3:ns
    cxs = [T(j-1)*ls for i in r, j in r]
    dx(i,j) = cxs[i,j]-cxs[ns÷2,ns÷2]
    cxc = [cxs[i,j]+sr*(dx(i,j))*exp(-0.5(dx(i,j)^2+dx(j,i)^2)/λ^2)/λ  for i in r, j in r]
    xs,xc,ys,yc=V(cxs),V(cxc),V(cxs'),V(cxc')
    # ys,yc=collect(xs'),collect(xc')
    xp,yp,xt,yt,fx,fy,zs=copy(xc),copy(yc),copy(xc),copy(yc),zero(xc),zero(xc),zero(xc)
    @. zs = sqrt((xc-xs)^2+(yc-ys)^2) 
    czs=Array(zs) ; zn=Node(czs) ; zl=sr*0.05
    scene = Scene(resolution = (400,400)) ; scale!(scene, 1, 1, 10000)
    heatmap!(scene,1:ns,1:ns,lift(z->z,zn),colorrange = (-zl,zl))
    # surface!(scene,1:ns,1:ns,lift(z->z,zn),colorrange = (-zl,zl))
    zlims!((-zl,zl))
    display(scene)
    GLMakie.record(scene, "output.gif", 1:nt÷fr, framerate=30) do j
    # for j in 1:nt÷fr
        for i in 1:fr
            @views @. fx[r1,r1] = -k*(eight*xc[r1,r1]-xc[r0,r0]-xc[r1,r0]-xc[r2,r0]-xc[r0,r1]-xc[r2,r1]-xc[r0,r2]-xc[r1,r2]-xc[r2,r2])
            @views @. fy[r1,r1] = -k*(eight*yc[r1,r1]-yc[r0,r0]-yc[r1,r0]-yc[r2,r0]-yc[r0,r1]-yc[r2,r1]-yc[r0,r2]-yc[r1,r2]-yc[r2,r2])
            @. xt = two*xc-xp+fx*dt2sm
            @. yt = two*yc-yp+fy*dt2sm
            xc,xp,xt = xt,xc,xp
            yc,yp,yt = yt,yc,yp
        end
        @. zs = sqrt((xc-xs)^2+(yc-ys)^2)
        czs=Array(zs)
        zn[] = czs
        yield()
    end
end
@time go()

toto

8 Likes

Here’s an example on how to keep the data on the GPU:

using GLMakie, CUDA
using AbstractFFTs

using GLMakie: GLAbstraction

function plot(T=Float32, N=1024*1024)   # 1024*1024 Float32's = 8MiB
    # generate dummy data: a simple gaussian
    # NOTE: for demonstration purposes, we're keeping the time series on the CPU
    t = range(-5, 5, length=N)
    f = CuArray{T}(exp.(-t.^2))


    ## initialization
    #
    # this should be only done once, keeping the buffer and resources across frames

    # get a buffer object and register it with CUDA
    buffer = GLAbstraction.GLBuffer(eltype(f), length(f))
    resource = let
        ref = Ref{CUDA.CUgraphicsResource}()
        CUDA.cuGraphicsGLRegisterBuffer(ref, buffer.id,
                                        CUDA.CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD)
        ref[]
    end


    ## main processing
    #
    # this needs to be done every time we get new data and want to plot it

    plot = NVTX.@range "main" begin
        NVTX.@range "CUDA" CUDA.@sync begin # doesn't actually need @sync, just for timings
            # map OpenGL buffer object for writing from CUDA
            CUDA.cuGraphicsMapResources(1, [resource], stream())

            # get a CuArray object that we can work with
            array = let
                ptr_ref = Ref{CUDA.CUdeviceptr}()
                numbytes_ref = Ref{Csize_t}()
                CUDA.cuGraphicsResourceGetMappedPointer_v2(ptr_ref, numbytes_ref, resource)

                ptr = reinterpret(CuPtr{T}, ptr_ref[])
                len = Int(numbytes_ref[] ÷ sizeof(T))

                unsafe_wrap(CuArray, ptr, len)
            end

            # example processing: compute the FFT
            # NOTE: real applications will want to perform a pre-planned in-place FFT
            F = fft(f)
            # shift the zero frequency component to the center
            array[1:N÷2] .= abs.(view(F, N÷2+1:N))
            array[N÷2+1:N] .= abs.(view(F, 1:N÷2))

            CUDA.cuGraphicsUnmapResources(1, [resource], stream())
        end

        NVTX.@range "Makie" lines(t, buffer)
    end


    ## clean-up

    CUDA.cuGraphicsUnregisterResource(resource)

    return plot
end

function main()
    # XXX: we need create a screen, which initializes a GL Context,
    #      so that we can create a GLBuffer before having rendered anything.
    GLMakie.Screen()

    save("plot.png", plot())
    return
end

I was experimenting with this in order to create a waterfall plot for data that resides on the GPU (hence the fft), but performance isn’t great: Processing the 8MB dummy data data using CUDA takes 500us, rendering that using GLMakie takes 500ms… Anything obvious I’m doing wrong here?

5 Likes

Multiple things that look problematic:

  1. I’m not sure how you time the rendering, but save("plot.png", plot()) does quite a bit more than just rendering one frame, so I guess 500ms could be (almost) plausible. Rendering exactly one frame takes a bit more manual setup
  2. GLMakie.Screen() + save("plot.png", plot()) isn’t guaranteed to use the same context
  3. lines(t, buffer) will run into the conversion machine and will just collect the buffer I’m afraid. I will need to double check, but GLBuffer(Point2f[...]) may get to OpenGL without getting converted

I can take a look later if I can update the example to correctly render one frame and use the same context and not convert the GLBuffer to an array

I was timing the NVTX ranges, so it should exclude the saving. But yeah, I’m new to Makie, so probably doing some silly mistakes :slight_smile: If this ends up working properly I’ll add the necessary high-level wrappers to CUDA.jl so that we can avoid some of the boilerplate here.

When tracing with NSight, I didn’t see any copies to CPU memory (i.e., no DtoH copies), so I don’t think it converted back.

When tracing with NSight, I didn’t see any copies to CPU memory

Does that trace OpenGL Buffer copies? The OpenGL backend can’t render separated x, y buffers, so if the plot produces an output, it should copy it to CPU…

Turns out there was some “scalar indexing” going on, where individual elements of the GPU buffers were being copied to the CPU for processing. By using GLBuffer{Point2f} directly, and using low-level APIs that avoid some of the automatic processing (e.g. center! inspects data to determine limits, lines! does a map to detect invalid vertices) I have a working example that keeps all data on the GPU:

function plot(; T=Float32, N=1024, resolution=(800,600))
    t = CUDA.rand(T, N)
    X = CUDA.rand(T, N)


    ## initialization
    #
    # this should be only done once, keeping the buffer and resources across frames

    # XXX: we need create a screen, which initializes a GL Context,
    #      so that we can create a GLBuffer before having rendered anything.
    screen = GLMakie.global_gl_screen(resolution, true)

    # get a buffer object and register it with CUDA
    buffer = GLAbstraction.GLBuffer(Point2f, N)
    resource = let
        ref = Ref{CUDA.CUgraphicsResource}()
        CUDA.cuGraphicsGLRegisterBuffer(ref, buffer.id,
                                        CUDA.CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD)
        ref[]
    end

    # NOTE: Makie's out-of-place API (lines, scatter) performs may iterating operations,
    #       like determining the range of the data, so we use a manual scene instead.
    scene = Scene(; resolution)
    cam2d!(scene)

    # XXX: manually position the cameral (`center!` would iterate data)
    cam = Makie.camera(scene)
    cam.projection[] = Makie.orthographicprojection(
        #= x =# 0f0, 1f0,
        #= y =# 0f0, 1f0,
        #= z =# 0f0, 1f0)


    ## main processing
    #
    # this needs to be done every time we get new data and want to plot it

    NVTX.@range "main" begin
        # process data, generate points
        NVTX.@range "CUDA" begin
            # map OpenGL buffer object for writing from CUDA
            CUDA.cuGraphicsMapResources(1, [resource], stream())

            # get a CuArray object that we can work with
            array = let
                ptr_ref = Ref{CUDA.CUdeviceptr}()
                numbytes_ref = Ref{Csize_t}()
                CUDA.cuGraphicsResourceGetMappedPointer_v2(ptr_ref, numbytes_ref, resource)

                ptr = reinterpret(CuPtr{Point2f}, ptr_ref[])
                len = Int(numbytes_ref[] ÷ sizeof(Point2f))

                unsafe_wrap(CuArray, ptr, len)
            end

            # generate points
            broadcast!(array, t, X) do x, y
                Point2f(x, y)
            end

            # wait for the GPU to finish
            synchronize()

            CUDA.cuGraphicsUnmapResources(1, [resource], stream())
        end

        # generate and render plot
        NVTX.@range "Makie" begin
            scatter!(scene, buffer)

            # force everything to render (for benchmarking purposes)
            GLMakie.render_frame(screen, resize_buffers=false)
            GLMakie.glFinish()
        end

    end

    save("plot.png", scene)


    ## clean-up

    CUDA.cuGraphicsUnregisterResource(resource)

    return
end

This performs well, doing all the rendering in a couple of 100s of us. It requires https://github.com/JuliaPlots/Makie.jl/pull/1803, and I’d also recommend to disable gpu_getindex so that scalar iteration of GLBuffer (a performance trap) is disallowed:

@eval GLMakie.GLAbstraction begin
    # XXX: to make scalar iteration error
    function gpu_getindex(b::GLBuffer{T}, range::UnitRange) where T
        error("GLBuffer getindex")  # XXX: for development
        multiplicator = sizeof(T)
        offset = first(range)-1
        value = Vector{T}(undef, length(range))
        bind(b)
        glGetBufferSubData(b.buffertype, multiplicator*offset, sizeof(value), value)
        bind(b, 0)
        return value
    end
end

It’s too bad that GLBuffer doesn’t support many array operations to make, e.g., lines! work properly. I wonder if it wouldn’t be better to pass CuArrays into Makie and have it do the necessary GL Interop calls automatically (which would make is possible to keep the CuArray around longer, and perform array operations on it, instead of eagerly converting it to a GLBuffer).

5 Likes

Any chance there has been further progress along this path? I am trying to implement CUDA/OpenGL graphics interop (i.e. to compute an array using CUDA and then render the array of data as a texture drawn on a rectangle in an OpenGL - or Makie - window).

Any chance you can provide more specific pointers on how to proceed?

For specific comparison code, a C version is available at:
Live Display via Graphics Interop | CUDA for Engineers: 2D Grids and Interactive Graphics | InformIT

5 Likes