Rubbish values from gpu kernel

Hi,
I am trying to assemble global stiffness matrix using GPU and coloring algorithm, so after coloring (i.e. group non adjacent cells in the same color) I have to launch a kernel for each color and by doing so, we don’t need to use the atomic addition. However, for every kernel launch adaption occurs to the same structs which is a great overhead (but this implementation works realy fine). So, in order to circumvent this I adapt the structs prior to the launch using the following code and sent them to the kernel however this implementation gives me rubbish values for the first time of invoking the function (i.e. 30064771074), however if I invoke the function for the second time it works well. So, does anybody knows the source of this pecuilar behaviour?

function assemble_global_gpu_color(cellvalues,dh,colors)
    K = create_sparsity_pattern(dh,Float32)
    Kgpu = CUSPARSE.CuSparseMatrixCSC(K)
    fgpu = CUDA.zeros(ndofs(dh))
    assembler = start_assemble(Kgpu, fgpu)
    n_colors = length(colors)
    # set up kernel adaption 
    dh_gpu = Adapt.adapt_structure(CUDA.KernelAdaptor(), dh)
    assembler_gpu = Adapt.adapt_structure(CUDA.KernelAdaptor(), assembler)
    cellvalues_gpu = Adapt.adapt_structure(CUDA.KernelAdaptor(), cellvalues)
    for i in 1:n_colors
        kernel = @cuda launch=false assemble_element_gpu!(assembler_gpu,cellvalues_gpu,dh_gpu,Int32(length(colors[i])),cu(colors[i]))
        #@show CUDA.registers(kernel)
        config = launch_configuration(kernel.fun)
        threads = min(length(colors[i]), config.threads)
        blocks =  cld(length(colors[i]), threads)
        kernel(assembler_gpu,cellvalues,dh_gpu,Int32(length(colors[i])),cu(colors[i]);  threads, blocks)
    end
    return Kgpu,fgpu
end

Hi,

Could you provide a minimum working example (MWE), i.e. a small piece of code we could execute to reproduce the problem?

Based on your error message, I would expect you are indeed just attempting to access incorrect memory, which would explain the garbage values. Perhaps within your kernel you overwrite it with more sensible values. No idea why you would not have any problems wihout the prior adapt, though.

Even though it’s not quite a straight forward logic, but I 'll try to provide the crtitical pieces.
First and foremost, this is my kernel and it basically assemble the local stiffness matrices in a global matrix for a heat equation problem.

function assemble_element_gpu!(assembler,cv,dh,n_cells_colored, eles_colored)
    tx = threadIdx().x 
    bx = blockIdx().x
    bd = blockDim().x
    e_color = tx + (bx-Int32(1))*bd # element number per color
   
    e_color ≤ n_cells_colored || return nothing # e here is the current element index.
    n_basefuncs = getnbasefunctions(cv)
    e = eles_colored[e_color]
    cell_coords = getcoordinates(dh.grid, e)

    ke = MMatrix{4,4,Float32}(undef) # Note: using n_basefuncs instead of 4 will throw an error because this type of dynamisim is not supported in GPU.
    fill!(ke, 0.0f0)
    fe = MVector{4,Float32}(undef)
    fill!(fe, 0.0f0)
     #Loop over quadrature points
     for qv in Ferrite.QuadratureValuesIterator(cv,cell_coords)
        ## Get the quadrature weight
        dΩ = getdetJdV(qv)
        ## Loop over test shape functions
        for i in 1:n_basefuncs
            δu  = shape_value(qv, i)
            ∇δu = shape_gradient(qv, i)
            ## Add contribution to fe
            fe[i] += δu * dΩ
            ## Loop over trial shape functions
            for j in 1:n_basefuncs
                ∇u = shape_gradient(qv, j)
                ## Add contribution to Ke
                ke[i,j] += (∇δu ⋅ ∇u) * dΩ
            end
        end
    end
   
    ## Assemble Ke into Kgpu ##
    assemble!(assembler, celldofs(dh,e),SMatrix(ke),SVector(fe)) # when passin mutable objects, throws and error

    return nothing
end

So, basically in this kernel there are two sources of issue:

  1. getcoordinates(dh.grid, e) and its implementation is as follows:
@inline function getcoordinates(grid::Ferrite.GPUGrid,e::Int32)
    # e is the element index.
    CT = get_coordinate_type(grid)
    cell = getcells(grid, e)
    N = nnodes(cell)
    x = MVector{N, CT}(undef) # local array to store the coordinates of the nodes of the cell.
    node_ids = get_node_ids(cell)
    
    for i in 1:length(x)
        #FIXME: Code for debugging only #[Rubbish No. 2]
        # Rubbish value for (node_ids[i])
        @cushow node_ids[i]

        x[i] = get_node_coordinate(grid, node_ids[i])
    end

    return SVector(x...)
end
  1. celldofs(dh,e) and its implementation is as follows:
function celldofs(dh::GPUDofHandler, i::Int32)
    offset = dh.cell_dofs_offset[i]
    ndofs = ndofs_per_cell(dh, i)

    #FIXME: Code for debugging only #[Rubbish No. 1]
    # dofs[i] gives other values instead of 4, even thought in the adapt_structure function, it is set to 4
    @cushow ndofs

   return @view dh.cell_dofs[offset:(offset+ndofs-Int32(1))]
end

other pieces of the code can be found in this PR:
Cuda heat example w quaditer by Abdelrahman912 · Pull Request #913 · Ferrite-FEM/Ferrite.jl (github.com)

Also, I’d like to add that when I call the Adapt.adapt_structure(CUDA.KernelAdaptor(), dh) three times (three lines caling the same function) the problem magically vanishes.

Just a note on the side: Argument adaption is not the place to implement expensive conversions, you should use dedicated GPU objects, or a separate adaptor for that. CUDA.jl assumes that conversions with the built-in KernelAdaptor are cheap and can be done before every kernel launch, or even multiple times (e.g. when it just needs type information). This is also the reason we don’t support nested CuArrays, which would require an allocation during adaption.

2 Likes

Adapt.adapt_structure(CUDA.KernelAdaptor(),...) also seems to be the culprit here. Changing from CUDA.KernelAdaptor() to something else and using Adapt.@adapt_structure Ferrite.GPUGrid etc seems to fix the issue. I assume that it boils down to e.g.

function Adapt.adapt_structure(to, grid::Grid)
    # map Int64 to Int32 to reduce number of registers
    cu_cells = grid.cells .|> (x -> Int32.(x.nodes)) .|> Quadrilateral |> cu
    cells = Adapt.adapt_structure(to, cu_cells)
    nodes = Adapt.adapt_structure(to, cu(grid.nodes))
    GPUGrid(cells,nodes)
end

being broken with KernelAdaptor, because the CuArrays are freed at the end of the function and hence the CuDeviceArrays refer to some dead memory? However, even if this is the case I do not really understand why calling adapt multiple times (or running the program multiple times) also fixes the issue.

Working example on the PR


function assemble_element_gpu!(#=assembler,cv,=#grid,n_cells_colored, eles_colored)
    tx = threadIdx().x 
    bx = blockIdx().x
    bd = blockDim().x
    e_color = tx + (bx-Int32(1))*bd # element number per color

    e_color ≤ n_cells_colored || return nothing # e here is the current element index.
    # @cushow e_color
    # n_basefuncs = getnbasefunctions(cv)
    e = eles_colored[e_color]
    # @cushow e
    cell_coords = getcoordinates(grid, e)

    # ke = MMatrix{4,4,Float32}(undef) # Note: using n_basefuncs instead of 4 will throw an error because this type of dynamisim is not supported in GPU.
    # fill!(ke, 0.0f0)
    # fe = MVector{4,Float32}(undef)
    # fill!(fe, 0.0f0)
     #Loop over quadrature points
    #  for qv in Ferrite.QuadratureValuesIterator(cv,cell_coords)
    #     ## Get the quadrature weight
    #     dΩ = getdetJdV(qv)
    #     ## Loop over test shape functions
    #     for i in 1:n_basefuncs
    #         δu  = shape_value(qv, i)
    #         ∇δu = shape_gradient(qv, i)
    #         ## Add contribution to fe
    #         fe[i] += δu * dΩ
    #         ## Loop over trial shape functions
    #         for j in 1:n_basefuncs
    #             ∇u = shape_gradient(qv, j)
    #             ## Add contribution to Ke
    #             ke[i,j] += (∇δu ⋅ ∇u) * dΩ
    #         end
    #     end
    # end

    ## Assemble Ke into Kgpu ##
    # assemble!(assembler, celldofs(dh,e),SMatrix(ke),SVector(fe)) # when passin mutable objects, throws and error

    return nothing
end

Adapt.@adapt_structure Ferrite.GPUGrid

function assemble_global_gpu_color(cellvalues,dh,colors)
    K = create_sparsity_pattern(dh,Float32)
    # Kgpu = CUSPARSE.CuSparseMatrixCSC(K)
    fgpu = CUDA.zeros(ndofs(dh))
    # assembler = start_assemble(Kgpu, fgpu)
    n_colors = length(colors)
    # set up kernel adaption 
    # FIXME: The following three lines are necessary to circumvent getting rubbish values.
    # one call will make dofs[i] give 6 instead of 4. (ref: see `gpu_assembler.jl`)
    # second call will fix the first issue but will give rubbish values for node_ids[i] (ref: see `gpu_grid.jl`)
    # third call will fix everything.
    gpu_grid = Adapt.adapt_structure(CuArray, dh.grid)
    cellvalues_gpu = Adapt.adapt_structure(CUDA.KernelAdaptor(), cellvalues)
    for i in 1:n_colors
        color = colors[i]
        gpucolor = cu(color)
        @show gpucolor
        kernel = @cuda launch=false assemble_element_gpu!(gpu_grid,Int32(length(color)),gpucolor)
        #@show CUDA.registers(kernel)
        config = launch_configuration(kernel.fun)
        threads = 1#min(length(colors[i]), config.threads)
        blocks =  cld(length(color), threads)
        kernel(gpu_grid,Int32(length(color)),gpucolor;  threads, blocks)
    end
    # return Kgpu,fgpu
end

to break it switch to

...
gpu_grid = Adapt.adapt_structure(CUDA.KernelAdaptor(), dh.grid)
...

I cannot procude any meaningful error messages beyond

Out-of-bounds array access
Stacktrace:
 [1] throw_boundserror at /home/dogiermann/.julia/packages/CUDA/75aiI/src/device/quirks.jl:15
 [2] multiple call sites at unknown:0

on CUDA 5.4.2 .

2 Likes

Thanks @termi-official for the tip, I tried your solution and worked :saluting_face: