Ragged Tensors with generic GPU code

BSchilperoort · May 19, 2025, 8:12am

I am working on porting (parts of) a hydrological model to GPU, to improve performance. This is largely going well because of the KernelAbstractions and AcceleratedKernels packages, which make it quite straightforward to write code that can run on CPU and GPU.

However, there is part of the model which I am unsure of how to port to GPU properly.

The model performs computations based on a graph network, where they store a mapping of source and destination nodes/edges as a ragged array.

A simplified version of the scheme I am trying to solve would be;

input = Vector{Float64}(1:6)
output = zeros(3)

io_map = Vector([
    Vector([1]),
    Vector([2, 3]),
    Vector([4, 5, 6]),
])

for i in eachindex(output)
    output[i] = sum(input[io_map[i]])
end

This structure is very similar to TensorFlow’s ragged tensors. Which do seem to be supported by the hardware drivers (ROCm) and CUDA at least).

Ideally I would use these ragged tensors, as it would mean a similar structure of the code as it already has, but I don’t see any support or documentation on this for the AMDGPU.jl, CUDA.jl or any other packages. Has anyone else encountered and/or solved this already?

The alternative would be to use matrix multiplication;

...

io_mat = Array{Int8}([
    1 0 0
    0 1 0
    0 1 0
    0 0 1
    0 0 1
    0 0 1
])

output = sum(io_mat .* input; dims = 1)

But the downside of this is that 1. I need overhaul the code and logic a lot, and 2. I need to get sparse matrices to work for my GPU (AMD) which ROCm doesn’t seem to do out-of-the-box.

BSchilperoort · May 19, 2025, 12:31pm

A workaround is to make the ragged tensors here dense;

function ragged_to_dense(ragged::Vector{Vector{T}}) where {T <: Any}
    nrows = length(ragged)
    ncols = maximum(length.(ragged))
    dense = zeros(T, nrows, ncols)

    for i in eachindex(ragged)
        for j in eachindex(ragged[i])
            dense[i, j] = ragged[i][j]
        end
    end
    return dense
end

Now with the dense arrays I need to mask out the zero-values when computing the sum;

function sum_at(A, inds)
    v = zero(eltype(A))
    for i in inds
        if i != 0
            v += A[i]
        end
    end
    return v
end

dense = ragged_to_dense(io_map)
for i in eachindex(output)
    output[i] = sum_at(input, dense[i, :])
end

And the vector output will have the same result as in the example above, however with arrays that can be ported to GPU. Making the arrays here sparse is also less necessary this way.

Mason · May 20, 2025, 10:24am

If you’re okay with converting from ragged to dense, why not just convert from ragged to sparse?

Topic		Replies	Views
Ragged Arrays New to Julia	5	2617	November 6, 2018
Row-wise operation using `map`? GPU question	2	654	May 2, 2022
GPU kernel? GPU	9	464	December 22, 2022
Efficient GPU dense-sparse matmul differentiable with Zygote Performance question , cuda , zygote , sparse	2	678	May 5, 2021
GPU parallelization for a very large system of ODEs Performance gpu , parallel	5	540	December 30, 2023

Ragged Tensors with generic GPU code

Related topics