Rubbish values from gpu kernel

Abdelrahman912 · July 4, 2024, 3:51pm

Hi,
I am trying to assemble global stiffness matrix using GPU and coloring algorithm, so after coloring (i.e. group non adjacent cells in the same color) I have to launch a kernel for each color and by doing so, we don’t need to use the atomic addition. However, for every kernel launch adaption occurs to the same structs which is a great overhead (but this implementation works realy fine). So, in order to circumvent this I adapt the structs prior to the launch using the following code and sent them to the kernel however this implementation gives me rubbish values for the first time of invoking the function (i.e. 30064771074), however if I invoke the function for the second time it works well. So, does anybody knows the source of this pecuilar behaviour?

function assemble_global_gpu_color(cellvalues,dh,colors)
    K = create_sparsity_pattern(dh,Float32)
    Kgpu = CUSPARSE.CuSparseMatrixCSC(K)
    fgpu = CUDA.zeros(ndofs(dh))
    assembler = start_assemble(Kgpu, fgpu)
    n_colors = length(colors)
    # set up kernel adaption 
    dh_gpu = Adapt.adapt_structure(CUDA.KernelAdaptor(), dh)
    assembler_gpu = Adapt.adapt_structure(CUDA.KernelAdaptor(), assembler)
    cellvalues_gpu = Adapt.adapt_structure(CUDA.KernelAdaptor(), cellvalues)
    for i in 1:n_colors
        kernel = @cuda launch=false assemble_element_gpu!(assembler_gpu,cellvalues_gpu,dh_gpu,Int32(length(colors[i])),cu(colors[i]))
        #@show CUDA.registers(kernel)
        config = launch_configuration(kernel.fun)
        threads = min(length(colors[i]), config.threads)
        blocks =  cld(length(colors[i]), threads)
        kernel(assembler_gpu,cellvalues,dh_gpu,Int32(length(colors[i])),cu(colors[i]);  threads, blocks)
    end
    return Kgpu,fgpu
end

eldee · July 6, 2024, 8:28am

Hi,

Could you provide a minimum working example (MWE), i.e. a small piece of code we could execute to reproduce the problem?

Based on your error message, I would expect you are indeed just attempting to access incorrect memory, which would explain the garbage values. Perhaps within your kernel you overwrite it with more sensible values. No idea why you would not have any problems wihout the prior adapt, though.

Abdelrahman912 · July 8, 2024, 12:48pm

Even though it’s not quite a straight forward logic, but I 'll try to provide the crtitical pieces.
First and foremost, this is my kernel and it basically assemble the local stiffness matrices in a global matrix for a heat equation problem.

function assemble_element_gpu!(assembler,cv,dh,n_cells_colored, eles_colored)
    tx = threadIdx().x 
    bx = blockIdx().x
    bd = blockDim().x
    e_color = tx + (bx-Int32(1))*bd # element number per color
   
    e_color ≤ n_cells_colored || return nothing # e here is the current element index.
    n_basefuncs = getnbasefunctions(cv)
    e = eles_colored[e_color]
    cell_coords = getcoordinates(dh.grid, e)

    ke = MMatrix{4,4,Float32}(undef) # Note: using n_basefuncs instead of 4 will throw an error because this type of dynamisim is not supported in GPU.
    fill!(ke, 0.0f0)
    fe = MVector{4,Float32}(undef)
    fill!(fe, 0.0f0)
     #Loop over quadrature points
     for qv in Ferrite.QuadratureValuesIterator(cv,cell_coords)
        ## Get the quadrature weight
        dΩ = getdetJdV(qv)
        ## Loop over test shape functions
        for i in 1:n_basefuncs
            δu  = shape_value(qv, i)
            ∇δu = shape_gradient(qv, i)
            ## Add contribution to fe
            fe[i] += δu * dΩ
            ## Loop over trial shape functions
            for j in 1:n_basefuncs
                ∇u = shape_gradient(qv, j)
                ## Add contribution to Ke
                ke[i,j] += (∇δu ⋅ ∇u) * dΩ
            end
        end
    end
   
    ## Assemble Ke into Kgpu ##
    assemble!(assembler, celldofs(dh,e),SMatrix(ke),SVector(fe)) # when passin mutable objects, throws and error

    return nothing
end

So, basically in this kernel there are two sources of issue:

getcoordinates(dh.grid, e) and its implementation is as follows:

@inline function getcoordinates(grid::Ferrite.GPUGrid,e::Int32)
    # e is the element index.
    CT = get_coordinate_type(grid)
    cell = getcells(grid, e)
    N = nnodes(cell)
    x = MVector{N, CT}(undef) # local array to store the coordinates of the nodes of the cell.
    node_ids = get_node_ids(cell)
    
    for i in 1:length(x)
        #FIXME: Code for debugging only #[Rubbish No. 2]
        # Rubbish value for (node_ids[i])
        @cushow node_ids[i]

        x[i] = get_node_coordinate(grid, node_ids[i])
    end

    return SVector(x...)
end

celldofs(dh,e) and its implementation is as follows:

function celldofs(dh::GPUDofHandler, i::Int32)
    offset = dh.cell_dofs_offset[i]
    ndofs = ndofs_per_cell(dh, i)

    #FIXME: Code for debugging only #[Rubbish No. 1]
    # dofs[i] gives other values instead of 4, even thought in the adapt_structure function, it is set to 4
    @cushow ndofs

   return @view dh.cell_dofs[offset:(offset+ndofs-Int32(1))]
end

other pieces of the code can be found in this PR:
Cuda heat example w quaditer by Abdelrahman912 · Pull Request #913 · Ferrite-FEM/Ferrite.jl (github.com)

Also, I’d like to add that when I call the Adapt.adapt_structure(CUDA.KernelAdaptor(), dh) three times (three lines caling the same function) the problem magically vanishes.

maleadt · July 8, 2024, 1:18pm

Just a note on the side: Argument adaption is not the place to implement expensive conversions, you should use dedicated GPU objects, or a separate adaptor for that. CUDA.jl assumes that conversions with the built-in KernelAdaptor are cheap and can be done before every kernel launch, or even multiple times (e.g. when it just needs type information). This is also the reason we don’t support nested CuArrays, which would require an allocation during adaption.

termi-official · July 8, 2024, 2:24pm

Adapt.adapt_structure(CUDA.KernelAdaptor(),...) also seems to be the culprit here. Changing from CUDA.KernelAdaptor() to something else and using Adapt.@adapt_structure Ferrite.GPUGrid etc seems to fix the issue. I assume that it boils down to e.g.

function Adapt.adapt_structure(to, grid::Grid)
    # map Int64 to Int32 to reduce number of registers
    cu_cells = grid.cells .|> (x -> Int32.(x.nodes)) .|> Quadrilateral |> cu
    cells = Adapt.adapt_structure(to, cu_cells)
    nodes = Adapt.adapt_structure(to, cu(grid.nodes))
    GPUGrid(cells,nodes)
end

being broken with KernelAdaptor, because the CuArrays are freed at the end of the function and hence the CuDeviceArrays refer to some dead memory? However, even if this is the case I do not really understand why calling adapt multiple times (or running the program multiple times) also fixes the issue.

Working example on the PR


function assemble_element_gpu!(#=assembler,cv,=#grid,n_cells_colored, eles_colored)
    tx = threadIdx().x 
    bx = blockIdx().x
    bd = blockDim().x
    e_color = tx + (bx-Int32(1))*bd # element number per color

    e_color ≤ n_cells_colored || return nothing # e here is the current element index.
    # @cushow e_color
    # n_basefuncs = getnbasefunctions(cv)
    e = eles_colored[e_color]
    # @cushow e
    cell_coords = getcoordinates(grid, e)

    # ke = MMatrix{4,4,Float32}(undef) # Note: using n_basefuncs instead of 4 will throw an error because this type of dynamisim is not supported in GPU.
    # fill!(ke, 0.0f0)
    # fe = MVector{4,Float32}(undef)
    # fill!(fe, 0.0f0)
     #Loop over quadrature points
    #  for qv in Ferrite.QuadratureValuesIterator(cv,cell_coords)
    #     ## Get the quadrature weight
    #     dΩ = getdetJdV(qv)
    #     ## Loop over test shape functions
    #     for i in 1:n_basefuncs
    #         δu  = shape_value(qv, i)
    #         ∇δu = shape_gradient(qv, i)
    #         ## Add contribution to fe
    #         fe[i] += δu * dΩ
    #         ## Loop over trial shape functions
    #         for j in 1:n_basefuncs
    #             ∇u = shape_gradient(qv, j)
    #             ## Add contribution to Ke
    #             ke[i,j] += (∇δu ⋅ ∇u) * dΩ
    #         end
    #     end
    # end

    ## Assemble Ke into Kgpu ##
    # assemble!(assembler, celldofs(dh,e),SMatrix(ke),SVector(fe)) # when passin mutable objects, throws and error

    return nothing
end

Adapt.@adapt_structure Ferrite.GPUGrid

function assemble_global_gpu_color(cellvalues,dh,colors)
    K = create_sparsity_pattern(dh,Float32)
    # Kgpu = CUSPARSE.CuSparseMatrixCSC(K)
    fgpu = CUDA.zeros(ndofs(dh))
    # assembler = start_assemble(Kgpu, fgpu)
    n_colors = length(colors)
    # set up kernel adaption 
    # FIXME: The following three lines are necessary to circumvent getting rubbish values.
    # one call will make dofs[i] give 6 instead of 4. (ref: see `gpu_assembler.jl`)
    # second call will fix the first issue but will give rubbish values for node_ids[i] (ref: see `gpu_grid.jl`)
    # third call will fix everything.
    gpu_grid = Adapt.adapt_structure(CuArray, dh.grid)
    cellvalues_gpu = Adapt.adapt_structure(CUDA.KernelAdaptor(), cellvalues)
    for i in 1:n_colors
        color = colors[i]
        gpucolor = cu(color)
        @show gpucolor
        kernel = @cuda launch=false assemble_element_gpu!(gpu_grid,Int32(length(color)),gpucolor)
        #@show CUDA.registers(kernel)
        config = launch_configuration(kernel.fun)
        threads = 1#min(length(colors[i]), config.threads)
        blocks =  cld(length(color), threads)
        kernel(gpu_grid,Int32(length(color)),gpucolor;  threads, blocks)
    end
    # return Kgpu,fgpu
end

to break it switch to

...
gpu_grid = Adapt.adapt_structure(CUDA.KernelAdaptor(), dh.grid)
...

I cannot procude any meaningful error messages beyond

Out-of-bounds array access
Stacktrace:
 [1] throw_boundserror at /home/dogiermann/.julia/packages/CUDA/75aiI/src/device/quirks.jl:15
 [2] multiple call sites at unknown:0

on CUDA 5.4.2 .

Abdelrahman912 · July 8, 2024, 3:06pm

Thanks @termi-official for the tip, I tried your solution and worked

Topic		Replies	Views
Arrays of arrays and arrays of structures in CUDA kernels cause random errors GPU gpu , cuda	21	2904	October 21, 2021
Implementing a basic Genetic Algorithm using GPUArrays GPU	18	3160	August 10, 2017
How to properly pass structs into GPU? (MWE included) GPU	6	1371	January 29, 2023
Kernel Compilation error- KernelError: recursion is currently not supported GPU	8	1343	November 8, 2019
Is is possible to merge multiple kernels in CUDAnative to minimize launch overhead and execution overhead? GPU	12	1648	November 11, 2018

Rubbish values from gpu kernel

Related topics