GPU kernel does not scale properly with data

Giorgos_Vretinaris · December 15, 2024, 2:46pm

Hello, I am trying to write a ODE solver on the GPU. I am solving the same simple ODE but for different initial data. I tried the EnsembleGPUArray option from DiffEqGPU.jl but it was quite slow.
The kernel I wrote is reading three CellArray objects, where one contains the state vectors, so the cell is of type SVector{5,Float32}, the other one is just a register for the Euler step, so a copy of the first one, and the last one contains parameters and is of type SVector{3,Float32}. Using these objects, it performs only one Euler step. The RHS of my ODE for one state vector on the CPU takes about 70ns.
The effective bandwidth on my Tesla K40c goes like:

The saxpy kernel

function _saxpy_kernel!(a, x, y)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i <= length(y)
        @inbounds y[i] = a * x[i] + y[i]
    end
    return nothing
end

gives a Teff≈ 178 GB/s. So the ODE kernel is much slower.
Could you please help me understand what I’m doing, or even thinking, wrong?

function evolve_gpu_kernel!(photons,photon_params,RHS,time_stepper,params)
    N,M,tmps,dt = params
    ix = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    iy = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    if ix <= N && iy <= M 
        photons[ix,iy] = time_stepper(RHS,tmps[ix,iy],photons[ix,iy],photon_params[ix,iy],dt)
    end
    return
end

function Euler(rhs,du,u,p,dt)
    du = rhs(u,p)
    return u .+ du .* dt
end


function run(p,nthreads=nothing)
    metric,N,M = p
    θ0 = deg2rad(87.5)
    alims = (-9,9) ### pθ/pt impact param
    blims = (-10,10) ### -pφ/pt impact param

    pos = SVector{D,T}([0,init_r,θ0,π/2])
    inputs,ps = generate_rays_gpu(pos, metric, N, M, alims, blims)
    photons_tmp = Matrix(reduce(hcat,vec(inputs))') |> cu
    photon_params_tmp = Matrix(reduce(hcat,vec(ps))') |> cu

    photons = CuCellArray{SVector{5,Float32}}(undef,N,M)
    tmp = CuCellArray{SVector{5,Float32}}(undef,N,M)
    photon_params = CuCellArray{SVector{3,Float32}}(undef,N,M)

    photons.data .= photons_tmp
    photon_params.data .= photon_params_tmp
        
    CUDA.unsafe_free!(photons_tmp)
    photons_tmp = nothing
    CUDA.unsafe_free!(photon_params_tmp)
    photon_params_tmp = nothing

    RHS = kerr_geod
    time_stepper = Euler
    dt = 0.01f0

    if isnothing(nthreads)
        MAX_THREADS_PER_BLOCK = CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
        nthreads = (isqrt(MAX_THREADS_PER_BLOCK), isqrt(MAX_THREADS_PER_BLOCK))
    end
    nblocks = cld.((N,M),nthreads)

    params = (N,M,tmp,dt)
    t = CUDA.@elapsed CUDA.@sync @cuda(threads=nthreads,blocks=nblocks,
                        evolve_gpu_kernel!(photons,photon_params,RHS,time_stepper,params))

    return t
end

function plot_diagnostic()
    idxs = [i for i in 4:10]
    x = 2 .^ idxs
    labels = [latexstring("2^{$idx}") for idx in idxs]
    time = n -> run((kerr_metric_bl,n,n),(16,16));
    t = time.(x)

    Aeff(N) = (2*5+3)*1e-9*N^2*sizeof(Float32) ## GB
    Teff = @. Aeff(x)/t 

    plot(x,t .* 1e3,
         xlabel="N", ylabel="Wall Time [ms]",
         label=nothing, tickfontsize = 14, dpi=144,
         xticks=(x,labels),xaxis=:log2)
    scatter!(x,t .* 1e3,label=nothing)
    savefig("wall_times.png")

    plot(x,Teff,
         xlabel="N", ylabel=L"$T_{\mathrm{eff}}$ [GB/s]",
         label=nothing,
         xticks=(x,labels),xaxis=:log2)
    scatter!(x,Teff,label=nothing)
    savefig("Teff.png")
end

maleadt · December 16, 2024, 2:06pm

Did you forget @inbounds? It’s possible that generates a lot of extra code, lowering occupancy. Try running under CUDA.@profile trace=true for details (notably, comparing the launch configuration and register use of SAXPY vs your kernel).

Giorgos_Vretinaris · December 16, 2024, 3:02pm

Thank you for your reply. I get the following error when running CUDA.@profile trace=true:

[ Info: This Julia session is already being profiled; defaulting to the external profiler.
ERROR: MethodError: no method matching profile_externally(::var"##132#profiled_code"; trace::Bool)

Closest candidates are:
  profile_externally(::Any) got unsupported keyword argument "trace"
   @ CUDA ~/.julia/packages/CUDA/Tl08O/src/profile.jl:130

The server that I am using does not have the external profilers shown in the docs of CUDA.jl, should I contact the admin to have them installed or is there any way to deal with this without the external profilers?

maleadt · December 16, 2024, 4:35pm

You shouldn’t have to install any external profiler for that functionality, in fact, I think you may be misinterpreting the error message: You’re somehow already running Julia already under a profiler, as attaching a new profiler fails (with an error that’s hard to interpret differently): CUDA.jl/src/profile.jl at 4e9513b8a4e56629a236b58504d609b1775a8236 · JuliaGPU/CUDA.jl · GitHub

Giorgos_Vretinaris · December 16, 2024, 4:48pm

I’m sorry but I am not exactly following you, as I am lacking a lot of knowledge, probably in Julia itself too.
I am not loading any profilers, at least not explicitly, I am only loading the following

using CellArrays, CUDA
using LinearAlgebra
using StaticArrays
using LaTeXStrings
using Images
using Plots

And this error pops up when I run the kernel under CUDA.@profile trace = true.
The error is not there if I don’t include the trace = true part, but then I just get a CUDA.KernelState object.

maleadt · December 16, 2024, 7:31pm

How are you starting julia?

Giorgos_Vretinaris · December 16, 2024, 7:40pm

I run julia -t 32 and I am currently using version 1.10.3.

maleadt · December 16, 2024, 8:01pm

Are you perhaps nesting CUDA.@profile invocations?

Giorgos_Vretinaris · December 16, 2024, 8:04pm

I only run

CUDA.@profile trace = true CUDA.@sync @cuda(threads=nthreads,blocks=nblocks,
                        evolve_gpu_kernel!(photons,photon_params,RHS,time_stepper,params))

maleadt · December 17, 2024, 6:06am

That’s bizarre. What happens if you run the logic of the CUPTI detection in isolation? Does it throw the same error?

Giorgos_Vretinaris · December 17, 2024, 8:40am

I can’t seem to find a way to call it via CUDA, so by copying and pasting the function onto the REPL after including the packages above, yields true for the entire detect_cupti function but the following snippet:

cfg = CUPTI.ActivityConfig([])
CUPTI.enable!(cfg) do
      # do nothing
end

returns

ERROR: CUPTIError: CUPTI doesn't allow multiple callback subscribers. Only a single subscriber can be registered at a time. (code 39, CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED)

Topic		Replies	Views
GPU kernel optimization (GPU vs CPU) GPU	3	1509	December 14, 2018
Why is this CUDA interpolation kernel scaling poorly? GPU question	1	298	May 8, 2024
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	238	June 4, 2025
CUDA.jl tutorial code kernel slower than broadcast Performance gpu , cuda	4	689	March 30, 2021
Correct utilisation of CUDA kernel for simulations GPU	16	575	February 13, 2025

GPU kernel does not scale properly with data

Related topics