Laguerre polynomials on GPU

Hello,

I’m currently able to calculate the generalized Laguerre polynomials on the CPU

using HypergeometricFunctions
_genlaguerre(n, α, x) = binomial(n+α,n) * HypergeometricFunctions.M(-n, α+1, x)

A = rand(1000, 1000)
_genlaguerre.(20, 3, A)

The reason I’m using HypergeometricFunctions.jl instead of SpecialPolynomials.jl is that the first is way faster.

Now, I want to do the same on a GPU, but it fails

using CUDA
CUDA.allowscalar(false)
A = CUDA.rand(1000, 1000)

_genlaguerre.(20, 3, A)


GPU broadcast resulted in non-concrete element type AbstractFloat.
This probably means that the function you are broadcasting contains an error or type instability.

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] copy
   @ ~/.julia/packages/GPUArrays/TnEpb/src/host/broadcast.jl:34 [inlined]
 [3] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(_genlaguerre), Tuple{Int64, Int64, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}})
   @ Base.Broadcast ./broadcast.jl:873
 [4] top-level scope
   @ In[19]:1

How can I add a fast and stable support for Laguerre polynomials on GPU?

Try throwing this call to _genlaguerre under @code_typed to see where the type instabilities are; it might be that this package is mixing Float64 and Float32 (Float32 is from your input array).

I believe this package depends on compiled binaries accessed via ccall… I don’t know if that can work on the GPU

As a quick workaround, maybe you can implement the recurrence relation in just a few lines of code, which should be very fast

I tried to use the following recursive function

function gen_laguerre(n::Integer, alpha::Number, x::T)::T where T <: BlasFloat
    n == 0 && return 1
    n == 1 && return 1 + alpha - x
    return ((2*n-1+alpha-x)*gen_laguerre(n-1,alpha,x)-(n-1+alpha)*gen_laguerre(n-2,alpha,x))/n
end

but, as discussed here, it returns illegal memory indexing when using the GPU. Moreover, trying it with che CPU, I noticed that it is very slow compared to the HypergeometricFunctions.jl solution.

The problem seems to be that HypergeometricFunctions.M might call HypergeometricFunctions._₁F₁general which calls gamma on an integer argument, resulting in promotion from Int64 to Float64.

If you pass only Float32 arguments, it will probably work, e.g.

function _genlaguerre(n::Integer, α::Integer, x::T) where {T<:AbstractFloat} 
    binomial(n+α, n) * HypergeometricFunctions.M(T(-n), T(α+1), x)
end

Still the same error

function _genlaguerre(n::Integer, α::Integer, x::T) where {T<:AbstractFloat} 
    binomial(n+α, n) * HypergeometricFunctions.M(T(-n), T(α+1), x)
end

A = CUDA.rand(100)
_genlaguerre.(10, 3, A)

GPU broadcast resulted in non-concrete element type Any.
This probably means that the function you are broadcasting contains an error or type instability.

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] copy
   @ ~/.julia/packages/GPUArrays/TnEpb/src/host/broadcast.jl:34 [inlined]
 [3] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(_genlaguerre2), Tuple{Int64, Int64, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}})
   @ Base.Broadcast ./broadcast.jl:873
 [4] top-level scope
   @ In[26]:2

I got a different error, but yes, it still does not work.

I found another simple method which is much faster than the recursive one, but a little bit slower than HypergeometricFunctions solution. However is fails to compile.

using SpecialFunctions
using LinearAlgebra: BlasFloat
using CUDA
CUDA.allowscalar(false)

function laguerre(n::Int, alpha::Int, x::T) where {T<:BlasFloat}
    L = zero(x)
    for k = 0:n
        L += ((-1)^k * binomial(n + alpha, n - k) * x^k) / factorial(k)
    end
    return L
end

A = CUDA.rand(100)

laguerre.(5, 1, A)

InvalidIRError: compiling MethodInstance for (::GPUArrays.var"#broadcast_kernel#26")(::CUDA.CuKernelContext, ::CuDeviceVector{Float32, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(laguerre), Tuple{Int64, Int64, Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to print_to_string(xs...) @ Base strings/io.jl:133)
Stacktrace:
 [1] string
   @ ./strings/io.jl:185
 [2] factorial_lookup
   @ ./combinatorics.jl:19
 [3] factorial
   @ ./combinatorics.jl:27
 [4] laguerre
   @ ./In[13]:6
 [5] _broadcast_getindex_evalf
   @ ./broadcast.jl:683
 [6] _broadcast_getindex
   @ ./broadcast.jl:656
 [7] getindex
   @ ./broadcast.jl:610
 [8] broadcast_kernel
   @ ~/.julia/packages/GPUArrays/t0LfC/src/host/broadcast.jl:59
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
...

You should be using the 3-term recurrence, not special functions like binomial and factorial. This is what ClassicalOrthogonalPolynomials.jl does:

julia> A = rand(100);

julia> @btime _genlaguerre.(10, 3, A);
  8.339 μs (3 allocations: 960 bytes)

julia> @btime Laguerre()[A, 1:11] # Compute [L_0.(A) … L_10.(A)]
  3.369 μs (1 allocation: 8.75 KiB)

The extra allocations is because it is building the entire OPs for n = 0:10. There actually is a fast way to get just the 10th OP without any extra allocations, but I’m not sure how well it would play with CUDA, so you might be better off just writing your own implementation of the 3-term recurrence for now.

You could make it a little faster by only computing the (multiplicative) difference between the terms:

function laguerre(n::Int, α::Int, x::T) where {T<:BlasFloat}
    t = binomial(n+α,n)
    L = t
    for k = 1:n
        t *= (-x) * (n-k+1) /  (k * (k+α))
        L += t
    end
    return L
end

Assuming I derived them correctly. This may even be easier on the GPU compiler ? I dont really know about that.

By the way why is α::Int in your code ? Are’nt the integer alpha cases equivalent to α=0 and a different n (and thus non-generalized Laguerre polynomials, for which there are better recurrences availiable) ?

Got rid of the allocations so about to tag a release where this is the speed:

julia> @btime Laguerre()[A,10];
  2.675 μs (1 allocation: 896 bytes)

It is for sure faster than the previous implementation on the CPU. But it still fails on the GPU.

When alpha is integer yes, it is the same of the non-generalized case. But in general ita can be a real number. In my implementation I need it integer, so I can actually use the non-generalized version.

Thanks, the performances are similar to those of the implementation suggested by @lrnv, with a little bit of more allocations.

However, for now, all of this implementations fails to run on the GPU.

So maybe you should concentrate on the non-generalized case, to use only factorials (that you could pre-compute and store somewhere) instead of binomial coefficients (which are potentially one of the reasons it fails on gpu ?)

But you can make your own implementation in 3 lines that can easily be used on a GPU:

function laguerre(n, x)
       p0,p1 = 0.0,1.0
       for k = 0:n-1
           p1,p0 = ((2k+1)/(k+1) - x/(k+1))*p1 - k/(k+1)*p0,p1
       end
       p1
end

It has the same performance as ClassicalOrthogonalPolynomials.jl:

julia> @btime laguerre.(10, A);
  2.425 μs (3 allocations: 960 bytes)
1 Like

I have another idea which might be needed if you need to compute all (non-generalized) Laguerre polynomials L_0,...,L_n each time: use their generating function and TaylorDiff.jl’s TaylorScalar implementation to get at once, with a maximum of automatically precompiled constants, all the values of these polynomials.

From wikipedia, the generating function of Laguerre polynomials writes:

G(t) = \sum_{n=0}^\infty t^n L_n^{(\alpha)}(x)= \frac{1}{(1-t)^{\alpha+1}} e^{-tx/(1-t)}.

And thus:

function laguerre(n::Int, alpha::Int, x::T) where {T}
    L = zero(x)
    for k = 0:n
        L += ((-1)^k * binomial(n + alpha, n - k) * x^k) / factorial(k)
    end
    return L
end

using TaylorDiff
G(t,x,α) = exp(-t*x/(1-t))/((1-t)^(α+1))
const n = 10
const facts = factorial.(0:n)
Lₙ(x,α) = G(TaylorScalar{Float64, n+1}(0., 1.),x,α).value ./ facts



x = rand()
α = 2

using BenchmarkTools
@btime laguerre(n, α, x)
@btime Lₙ(x, α)

Which gives:

julia> @btime laguerre(n, α, x)
  349.296 ns (1 allocation: 16 bytes)
44.53956475248684

julia> @btime Lₙ(x, α)
  234.384 ns (1 allocation: 144 bytes)
11-element Vector{Float64}:
  1.0
  2.8894999652795006
  5.564104989954617
  8.925300424661943
 12.880233216342091
 17.341499322835475
 22.226937067531274
 27.45942630179896
 32.96669324538088
 38.681120877334735
 44.53956475248665

julia> 

@lrnv This is completely unnecessary. Computing orthogonal polynomials is a solved problem. I’m an expert at numerics and orthogonal polynomials: you should only ever be using the 3-term recurrence as in the code I just posted.

3 Likes

I trust you of course, I am not an expert of the subject at all. The method i just proposed is perhaps more general, and probably as you say useless for this particular case. However i like it for it’s simplicity :slight_smile:

All orthogonal polynomials have 3-term recurrences and for classical OPs they are simple raitonal expressions:

https://dlmf.nist.gov/18.9

1 Like