KernelAbstractions for splines

I’m trying to code a simple b-spline (and then NURBS) on the GPU using kernels to speed up a ton of simultaneous evaluations.

I’m having trouble because Cox-deBoor is recursive and throws an error when I try to use this as a kernel. Alternatively, I could use de Boor’s method but this requires allocating an array, which is also not allowed in a kernel. In the code below, I avoid the problem by “manually” unrolling the recursion in Cox-de Boor, but I’m sure there is a better way.

There is also the issue that I need the spline data on the GPU, and I think I have correctly handled this using a wrapper type, but perhaps this isn’t optimal either.

#wrapper type
struct BSplineCurve{A<:AbstractArray,V<:AbstractVector,T<:Int} <: Function
    pnts::A
    knots::V
    degree::T
end
using StaticArrays
function BSplineCurve(pnts;degree=3)
    count,T = size(pnts, 2),promote_type(eltype(pnts),Float32)
    knots = [zeros(degree); collect(range(0, count-degree) / (count-degree)); ones(degree)]
    BSplineCurve(pnts,SA{T}[knots...],degree)
end
function (l::BSplineCurve)(u::T) where T
    n = size(l.pnts, 2)
    pt = SA{T}[0,0]
    for k in 1:n
        pt += Bd(l.knots,u,k,l.degree)*l.pnts[:,k]
    end
    pt
end
# terrible unrolled recursion 
function Bd(knots, u, k, d)
    d==1 && return Bd1(knots,u,k)
    d==2 && return Bd2(knots,u,k)
    return Bd3(knots,u,k)
end
function Bd3(knots, u, k)
    d=3
    α = ifelse(knots[k]<u≤knots[k+d],(u-knots[k])/(knots[k+d]-knots[k])*Bd2(knots,u,k),0f0)
    β = ifelse(knots[k+1]≤u<knots[k+d+1],(knots[k+d+1]-u)/(knots[k+d+1]-knots[k+1])*Bd2(knots,u,k+1),0f0)
    α+β
end
function Bd2(knots, u, k)
    d=2
    α = ifelse(knots[k]<u≤knots[k+d],(u-knots[k])/(knots[k+d]-knots[k])*Bd1(knots,u,k),0f0)
    β = ifelse(knots[k+1]≤u<knots[k+d+1],(knots[k+d+1]-u)/(knots[k+d+1]-knots[k+1])*Bd1(knots,u,k+1),0f0)
    α+β
end
function Bd1(knots, u, k)
    d=1
    α = ifelse(knots[k]<u≤knots[k+d],(u-knots[k])/(knots[k+d]-knots[k]),0f0)
    β = ifelse(knots[k+1]≤u<knots[k+d+1],(knots[k+d+1]-u)/(knots[k+d+1]-knots[k+1]),0f0)
    α+β
end

# Does it work with KernelAbstractions?
using CUDA, KernelAbstractions
@kernel function _test!(a::AbstractArray,l::BSplineCurve)
    # Map index to physical space
    I = @index(Global)
    u = (I-1)/(length(a)-1)
    q = l(u)
    a[I] = q'*q # just want some scalar output
end
test!(a,l)=_test!(get_backend(a),64)(a,l,ndrange=length(a))

a = CUDA.zeros(64)
square = BSplineCurve(SA[5 5 0 -5 -5 -5  0  5 5
                         0 5 5  5  0 -5 -5 -5 0],degree=1)

test!(a,square)
a|>Array # yes!

Your pt = SA{T}[0.0] in l looks wrong; that creates an Array{SA{T}}

It’s SA{T}[0, 0] , so I think it makes a 2D static vector of the correct type without allocating an array. I was more looking for ways around the recursion issue, but if you have a better approach, I’m happy to hear it. :blush:

Who thought this was a good idea?

  SA{T}[ elements ]

  Create SArray literals using array construction syntax. The element type is
  inferred by promoting elements to a common type or set to T when T is
  provided explicitly.

  Examples:

Very easy to confuse with the normal Julia syntax…

but this requires allocating an array

If used with care MArray works on the GPU.

is recursive

There is some fundamental theorem of computing that states that every recursive algorithm can be implemented as an iterative one, but of course you still need workspace (hence MArray).

If you know that you need to unroll it a fixed amount. You can also use a Val{N} and Julia will happily inline the entirety.

So it would be something like:

function Bd1(knots, u, k, ::Val{d}) where d
    if d == 1
        factor = 1
    else
       factor = Bd(knots, u, k+1, Val(d-1))
    end

    α = ifelse(knots[k]<u≤knots[k+d],(u-knots[k])/(knots[k+d]-knots[k])*factor,0f0)
    β = ifelse(knots[k+1]≤u<knots[k+d+1],(knots[k+d+1]-u)/(knots[k+d+1]-knots[k+1])*factor,0f0)
    α+β
end

Isn’t the iterative implementation simply:

function Bd(knots, u, k, max_d)
factor = 1.0

for d in 1:max_d
   α = ifelse(knots[k]<u≤knots[k+d],(u-knots[k])/(knots[k+d]-knots[k])*factor,0f0)
   β = ifelse(knots[k+1]≤u<knots[k+d+1],(knots[k+d+1]-u)/(knots[k+d+1]-knots[k+1])*factor,0f0)
   factor = α+β
end
return factor

Thanks for the tips!

The de Boor algorithm is a stable iterative version of Cox-de Boor, but it requires an updateable array. I’ll check into MArray, I haven’t used those before. What do I need to be careful of, in order to make the GPU happy with it? Do I need to allocate it within the struct, or within the l function definition?

Edit: I just tried using ::Val{d}. This seems to work:

struct BSplineCurve{d,A<:AbstractArray,V<:AbstractVector} <: Function
    pnts::A
    knots::V
end
using StaticArrays
function BSplineCurve(pnts;degree=3)
    count,T = size(pnts, 2),promote_type(eltype(pnts),Float32)
    knots = SA{T}[[zeros(degree); collect(range(0, count-degree) / (count-degree)); ones(degree)]...]
    BSplineCurve{degree,typeof(pnts),typeof(knots)}(pnts,knots)
end
function (l::BSplineCurve{d})(u::T) where {T,d}
    pt = SA{T}[0, 0]
    for k in 1:size(l.pnts, 2)
        pt += Bd(l.knots,u,k,Val(d))*l.pnts[:,k]
    end
    pt
end
Bd(knots, u, k, ::Val{0}) = 1
function Bd(knots, u, k, ::Val{d}) where d
    B = 0
    knots[k]<u≤knots[k+d] && (B+=(u-knots[k])/(knots[k+d]-knots[k])*Bd(knots,u,k,Val(d-1)))
    knots[k+1]≤u<knots[k+d+1] && (B+=(knots[k+d+1]-u)/(knots[k+d+1]-knots[k+1])*Bd(knots,u,k+1,Val(d-1)))
    return B
end
1 Like

@vchuravy I’m Sorry to dig this up from six months ago; I am working on extending this B-Spline code to enable modification of the control points. I am using this exact code (that’s why I didn’t make a new post).

If the code is run on the CPU, specifying the input of the BSplineCurve constructor as MArrays does the trick (I can change the control points). However, this doesn’t run on the GPU as MArray is not an isbit.

It seems MArray cannot be used on the GPU (I am not sure).

Do you see a simple fix for this, or making it work on the GPU with the possibility of changing BSplineCurve.pnts with new values would require writing all this into CUDA code using Array/CuArray for storage?

Thanks in advance!