Ragged Tensors with generic GPU code

I am working on porting (parts of) a hydrological model to GPU, to improve performance. This is largely going well because of the KernelAbstractions and AcceleratedKernels packages, which make it quite straightforward to write code that can run on CPU and GPU.

However, there is part of the model which I am unsure of how to port to GPU properly.

The model performs computations based on a graph network, where they store a mapping of source and destination nodes/edges as a ragged array.

A simplified version of the scheme I am trying to solve would be;

input = Vector{Float64}(1:6)
output = zeros(3)

io_map = Vector([
    Vector([1]),
    Vector([2, 3]),
    Vector([4, 5, 6]),
])

for i in eachindex(output)
    output[i] = sum(input[io_map[i]])
end

This structure is very similar to TensorFlow’s ragged tensors. Which do seem to be supported by the hardware drivers (ROCm) and CUDA at least).

Ideally I would use these ragged tensors, as it would mean a similar structure of the code as it already has, but I don’t see any support or documentation on this for the AMDGPU.jl, CUDA.jl or any other packages. Has anyone else encountered and/or solved this already?


The alternative would be to use matrix multiplication;

...

io_mat = Array{Int8}([
    1 0 0
    0 1 0
    0 1 0
    0 0 1
    0 0 1
    0 0 1
])

output = sum(io_mat .* input; dims = 1)

But the downside of this is that 1. I need overhaul the code and logic a lot, and 2. I need to get sparse matrices to work for my GPU (AMD) which ROCm doesn’t seem to do out-of-the-box.

A workaround is to make the ragged tensors here dense;

function ragged_to_dense(ragged::Vector{Vector{T}}) where {T <: Any}
    nrows = length(ragged)
    ncols = maximum(length.(ragged))
    dense = zeros(T, nrows, ncols)

    for i in eachindex(ragged)
        for j in eachindex(ragged[i])
            dense[i, j] = ragged[i][j]
        end
    end
    return dense
end

Now with the dense arrays I need to mask out the zero-values when computing the sum;

function sum_at(A, inds)
    v = zero(eltype(A))
    for i in inds
        if i != 0
            v += A[i]
        end
    end
    return v
end

dense = ragged_to_dense(io_map)
for i in eachindex(output)
    output[i] = sum_at(input, dense[i, :])
end

And the vector output will have the same result as in the example above, however with arrays that can be ported to GPU. Making the arrays here sparse is also less necessary this way.

If you’re okay with converting from ragged to dense, why not just convert from ragged to sparse?