I’m currently able to calculate the generalized Laguerre polynomials on the CPU

using HypergeometricFunctions
_genlaguerre(n, α, x) = binomial(n+α,n) * HypergeometricFunctions.M(-n, α+1, x)
A = rand(1000, 1000)
_genlaguerre.(20, 3, A)

The reason I’m using HypergeometricFunctions.jl instead of SpecialPolynomials.jl is that the first is way faster.

Now, I want to do the same on a GPU, but it fails

using CUDA
CUDA.allowscalar(false)
A = CUDA.rand(1000, 1000)
_genlaguerre.(20, 3, A)
GPU broadcast resulted in non-concrete element type AbstractFloat.
This probably means that the function you are broadcasting contains an error or type instability.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] copy
@ ~/.julia/packages/GPUArrays/TnEpb/src/host/broadcast.jl:34 [inlined]
[3] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(_genlaguerre), Tuple{Int64, Int64, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}})
@ Base.Broadcast ./broadcast.jl:873
[4] top-level scope
@ In[19]:1

How can I add a fast and stable support for Laguerre polynomials on GPU?

Try throwing this call to _genlaguerre under @code_typed to see where the type instabilities are; it might be that this package is mixing Float64 and Float32 (Float32 is from your input array).

function gen_laguerre(n::Integer, alpha::Number, x::T)::T where T <: BlasFloat
n == 0 && return 1
n == 1 && return 1 + alpha - x
return ((2*n-1+alpha-x)*gen_laguerre(n-1,alpha,x)-(n-1+alpha)*gen_laguerre(n-2,alpha,x))/n
end

but, as discussed here, it returns illegal memory indexing when using the GPU. Moreover, trying it with che CPU, I noticed that it is very slow compared to the HypergeometricFunctions.jl solution.

The problem seems to be that HypergeometricFunctions.M might call HypergeometricFunctions._₁F₁general which calls gamma on an integer argument, resulting in promotion from Int64 to Float64.

If you pass only Float32 arguments, it will probably work, e.g.

function _genlaguerre(n::Integer, α::Integer, x::T) where {T<:AbstractFloat}
binomial(n+α, n) * HypergeometricFunctions.M(T(-n), T(α+1), x)
end

function _genlaguerre(n::Integer, α::Integer, x::T) where {T<:AbstractFloat}
binomial(n+α, n) * HypergeometricFunctions.M(T(-n), T(α+1), x)
end
A = CUDA.rand(100)
_genlaguerre.(10, 3, A)
GPU broadcast resulted in non-concrete element type Any.
This probably means that the function you are broadcasting contains an error or type instability.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] copy
@ ~/.julia/packages/GPUArrays/TnEpb/src/host/broadcast.jl:34 [inlined]
[3] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(_genlaguerre2), Tuple{Int64, Int64, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}})
@ Base.Broadcast ./broadcast.jl:873
[4] top-level scope
@ In[26]:2

I found another simple method which is much faster than the recursive one, but a little bit slower than HypergeometricFunctions solution. However is fails to compile.

using SpecialFunctions
using LinearAlgebra: BlasFloat
using CUDA
CUDA.allowscalar(false)
function laguerre(n::Int, alpha::Int, x::T) where {T<:BlasFloat}
L = zero(x)
for k = 0:n
L += ((-1)^k * binomial(n + alpha, n - k) * x^k) / factorial(k)
end
return L
end
A = CUDA.rand(100)
laguerre.(5, 1, A)
InvalidIRError: compiling MethodInstance for (::GPUArrays.var"#broadcast_kernel#26")(::CUDA.CuKernelContext, ::CuDeviceVector{Float32, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(laguerre), Tuple{Int64, Int64, Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to print_to_string(xs...) @ Base strings/io.jl:133)
Stacktrace:
[1] string
@ ./strings/io.jl:185
[2] factorial_lookup
@ ./combinatorics.jl:19
[3] factorial
@ ./combinatorics.jl:27
[4] laguerre
@ ./In[13]:6
[5] _broadcast_getindex_evalf
@ ./broadcast.jl:683
[6] _broadcast_getindex
@ ./broadcast.jl:656
[7] getindex
@ ./broadcast.jl:610
[8] broadcast_kernel
@ ~/.julia/packages/GPUArrays/t0LfC/src/host/broadcast.jl:59
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
...

The extra allocations is because it is building the entire OPs for n = 0:10. There actually is a fast way to get just the 10th OP without any extra allocations, but I’m not sure how well it would play with CUDA, so you might be better off just writing your own implementation of the 3-term recurrence for now.

You could make it a little faster by only computing the (multiplicative) difference between the terms:

function laguerre(n::Int, α::Int, x::T) where {T<:BlasFloat}
t = binomial(n+α,n)
L = t
for k = 1:n
t *= (-x) * (n-k+1) / (k * (k+α))
L += t
end
return L
end

Assuming I derived them correctly. This may even be easier on the GPU compiler ? I dont really know about that.

By the way why is α::Int in your code ? Are’nt the integer alpha cases equivalent to α=0 and a different n (and thus non-generalized Laguerre polynomials, for which there are better recurrences availiable) ?

It is for sure faster than the previous implementation on the CPU. But it still fails on the GPU.

When alpha is integer yes, it is the same of the non-generalized case. But in general ita can be a real number. In my implementation I need it integer, so I can actually use the non-generalized version.

So maybe you should concentrate on the non-generalized case, to use only factorials (that you could pre-compute and store somewhere) instead of binomial coefficients (which are potentially one of the reasons it fails on gpu ?)

I have another idea which might be needed if you need to compute all (non-generalized) Laguerre polynomials L_0,...,L_n each time: use their generating function and TaylorDiff.jl’s TaylorScalar implementation to get at once, with a maximum of automatically precompiled constants, all the values of these polynomials.

From wikipedia, the generating function of Laguerre polynomials writes:

function laguerre(n::Int, alpha::Int, x::T) where {T}
L = zero(x)
for k = 0:n
L += ((-1)^k * binomial(n + alpha, n - k) * x^k) / factorial(k)
end
return L
end
using TaylorDiff
G(t,x,α) = exp(-t*x/(1-t))/((1-t)^(α+1))
const n = 10
const facts = factorial.(0:n)
Lₙ(x,α) = G(TaylorScalar{Float64, n+1}(0., 1.),x,α).value ./ facts
x = rand()
α = 2
using BenchmarkTools
@btime laguerre(n, α, x)
@btime Lₙ(x, α)

@lrnv This is completely unnecessary. Computing orthogonal polynomials is a solved problem. I’m an expert at numerics and orthogonal polynomials: you should only ever be using the 3-term recurrence as in the code I just posted.

I trust you of course, I am not an expert of the subject at all. The method i just proposed is perhaps more general, and probably as you say useless for this particular case. However i like it for it’s simplicity