# Laguerre polynomials on GPU

Hello,

I’m currently able to calculate the generalized Laguerre polynomials on the CPU

using HypergeometricFunctions
_genlaguerre(n, α, x) = binomial(n+α,n) * HypergeometricFunctions.M(-n, α+1, x)

A = rand(1000, 1000)
_genlaguerre.(20, 3, A)


The reason I’m using HypergeometricFunctions.jl instead of SpecialPolynomials.jl is that the first is way faster.

Now, I want to do the same on a GPU, but it fails

using CUDA
CUDA.allowscalar(false)
A = CUDA.rand(1000, 1000)

_genlaguerre.(20, 3, A)

GPU broadcast resulted in non-concrete element type AbstractFloat.
This probably means that the function you are broadcasting contains an error or type instability.

Stacktrace:
 error(s::String)
@ Base ./error.jl:35
 copy
 top-level scope
@ In:1


How can I add a fast and stable support for Laguerre polynomials on GPU?

Try throwing this call to _genlaguerre under @code_typed to see where the type instabilities are; it might be that this package is mixing Float64 and Float32 (Float32 is from your input array).

I believe this package depends on compiled binaries accessed via ccall… I don’t know if that can work on the GPU

As a quick workaround, maybe you can implement the recurrence relation in just a few lines of code, which should be very fast

I tried to use the following recursive function

function gen_laguerre(n::Integer, alpha::Number, x::T)::T where T <: BlasFloat
n == 0 && return 1
n == 1 && return 1 + alpha - x
return ((2*n-1+alpha-x)*gen_laguerre(n-1,alpha,x)-(n-1+alpha)*gen_laguerre(n-2,alpha,x))/n
end


but, as discussed here, it returns illegal memory indexing when using the GPU. Moreover, trying it with che CPU, I noticed that it is very slow compared to the HypergeometricFunctions.jl solution.

The problem seems to be that HypergeometricFunctions.M might call HypergeometricFunctions._₁F₁general which calls gamma on an integer argument, resulting in promotion from Int64 to Float64.

If you pass only Float32 arguments, it will probably work, e.g.

function _genlaguerre(n::Integer, α::Integer, x::T) where {T<:AbstractFloat}
binomial(n+α, n) * HypergeometricFunctions.M(T(-n), T(α+1), x)
end


Still the same error

function _genlaguerre(n::Integer, α::Integer, x::T) where {T<:AbstractFloat}
binomial(n+α, n) * HypergeometricFunctions.M(T(-n), T(α+1), x)
end

A = CUDA.rand(100)
_genlaguerre.(10, 3, A)

GPU broadcast resulted in non-concrete element type Any.
This probably means that the function you are broadcasting contains an error or type instability.

Stacktrace:
 error(s::String)
@ Base ./error.jl:35
 copy
 top-level scope
@ In:2


I got a different error, but yes, it still does not work.

I found another simple method which is much faster than the recursive one, but a little bit slower than HypergeometricFunctions solution. However is fails to compile.

using SpecialFunctions
using LinearAlgebra: BlasFloat
using CUDA
CUDA.allowscalar(false)

function laguerre(n::Int, alpha::Int, x::T) where {T<:BlasFloat}
L = zero(x)
for k = 0:n
L += ((-1)^k * binomial(n + alpha, n - k) * x^k) / factorial(k)
end
return L
end

A = CUDA.rand(100)

laguerre.(5, 1, A)

Reason: unsupported dynamic function invocation (call to print_to_string(xs...) @ Base strings/io.jl:133)
Stacktrace:
 string
@ ./strings/io.jl:185
 factorial_lookup
@ ./combinatorics.jl:19
 factorial
@ ./combinatorics.jl:27
 laguerre
@ ./In:6
 getindex
Hint: catch this exception as err and call code_typed(err; interactive = true) to introspect the erronous code with Cthulhu.jl
...


You should be using the 3-term recurrence, not special functions like binomial and factorial. This is what ClassicalOrthogonalPolynomials.jl does:

julia> A = rand(100);

julia> @btime _genlaguerre.(10, 3, A);
8.339 μs (3 allocations: 960 bytes)

julia> @btime Laguerre()[A, 1:11] # Compute [L_0.(A) … L_10.(A)]
3.369 μs (1 allocation: 8.75 KiB)


The extra allocations is because it is building the entire OPs for n = 0:10. There actually is a fast way to get just the 10th OP without any extra allocations, but I’m not sure how well it would play with CUDA, so you might be better off just writing your own implementation of the 3-term recurrence for now.

You could make it a little faster by only computing the (multiplicative) difference between the terms:

function laguerre(n::Int, α::Int, x::T) where {T<:BlasFloat}
t = binomial(n+α,n)
L = t
for k = 1:n
t *= (-x) * (n-k+1) /  (k * (k+α))
L += t
end
return L
end


Assuming I derived them correctly. This may even be easier on the GPU compiler ? I dont really know about that.

By the way why is α::Int in your code ? Are’nt the integer alpha cases equivalent to α=0 and a different n (and thus non-generalized Laguerre polynomials, for which there are better recurrences availiable) ?

Got rid of the allocations so about to tag a release where this is the speed:

julia> @btime Laguerre()[A,10];
2.675 μs (1 allocation: 896 bytes)


It is for sure faster than the previous implementation on the CPU. But it still fails on the GPU.

When alpha is integer yes, it is the same of the non-generalized case. But in general ita can be a real number. In my implementation I need it integer, so I can actually use the non-generalized version.

Thanks, the performances are similar to those of the implementation suggested by @lrnv, with a little bit of more allocations.

However, for now, all of this implementations fails to run on the GPU.

So maybe you should concentrate on the non-generalized case, to use only factorials (that you could pre-compute and store somewhere) instead of binomial coefficients (which are potentially one of the reasons it fails on gpu ?)

But you can make your own implementation in 3 lines that can easily be used on a GPU:

function laguerre(n, x)
p0,p1 = 0.0,1.0
for k = 0:n-1
p1,p0 = ((2k+1)/(k+1) - x/(k+1))*p1 - k/(k+1)*p0,p1
end
p1
end


It has the same performance as ClassicalOrthogonalPolynomials.jl:

julia> @btime laguerre.(10, A);
2.425 μs (3 allocations: 960 bytes)

1 Like

I have another idea which might be needed if you need to compute all (non-generalized) Laguerre polynomials L_0,...,L_n each time: use their generating function and TaylorDiff.jl’s TaylorScalar implementation to get at once, with a maximum of automatically precompiled constants, all the values of these polynomials.

From wikipedia, the generating function of Laguerre polynomials writes:

G(t) = \sum_{n=0}^\infty t^n L_n^{(\alpha)}(x)= \frac{1}{(1-t)^{\alpha+1}} e^{-tx/(1-t)}.

And thus:

function laguerre(n::Int, alpha::Int, x::T) where {T}
L = zero(x)
for k = 0:n
L += ((-1)^k * binomial(n + alpha, n - k) * x^k) / factorial(k)
end
return L
end

using TaylorDiff
G(t,x,α) = exp(-t*x/(1-t))/((1-t)^(α+1))
const n = 10
const facts = factorial.(0:n)
Lₙ(x,α) = G(TaylorScalar{Float64, n+1}(0., 1.),x,α).value ./ facts

x = rand()
α = 2

using BenchmarkTools
@btime laguerre(n, α, x)
@btime Lₙ(x, α)


Which gives:

julia> @btime laguerre(n, α, x)
349.296 ns (1 allocation: 16 bytes)
44.53956475248684

julia> @btime Lₙ(x, α)
234.384 ns (1 allocation: 144 bytes)
11-element Vector{Float64}:
1.0
2.8894999652795006
5.564104989954617
8.925300424661943
12.880233216342091
17.341499322835475
22.226937067531274
27.45942630179896
32.96669324538088
38.681120877334735
44.53956475248665

julia>


@lrnv This is completely unnecessary. Computing orthogonal polynomials is a solved problem. I’m an expert at numerics and orthogonal polynomials: you should only ever be using the 3-term recurrence as in the code I just posted.

3 Likes

I trust you of course, I am not an expert of the subject at all. The method i just proposed is perhaps more general, and probably as you say useless for this particular case. However i like it for it’s simplicity All orthogonal polynomials have 3-term recurrences and for classical OPs they are simple raitonal expressions:

https://dlmf.nist.gov/18.9

1 Like