Laguerre polynomials on GPU

albertomercurio · May 29, 2023, 3:42pm

Hello,

I’m currently able to calculate the generalized Laguerre polynomials on the CPU

using HypergeometricFunctions
_genlaguerre(n, α, x) = binomial(n+α,n) * HypergeometricFunctions.M(-n, α+1, x)

A = rand(1000, 1000)
_genlaguerre.(20, 3, A)

The reason I’m using HypergeometricFunctions.jl instead of SpecialPolynomials.jl is that the first is way faster.

Now, I want to do the same on a GPU, but it fails

using CUDA
CUDA.allowscalar(false)
A = CUDA.rand(1000, 1000)

_genlaguerre.(20, 3, A)


GPU broadcast resulted in non-concrete element type AbstractFloat.
This probably means that the function you are broadcasting contains an error or type instability.

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] copy
   @ ~/.julia/packages/GPUArrays/TnEpb/src/host/broadcast.jl:34 [inlined]
 [3] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(_genlaguerre), Tuple{Int64, Int64, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}})
   @ Base.Broadcast ./broadcast.jl:873
 [4] top-level scope
   @ In[19]:1

How can I add a fast and stable support for Laguerre polynomials on GPU?

jpsamaroo · May 29, 2023, 4:39pm

Try throwing this call to _genlaguerre under @code_typed to see where the type instabilities are; it might be that this package is mixing Float64 and Float32 (Float32 is from your input array).

martin.d.maas · May 29, 2023, 4:55pm

I believe this package depends on compiled binaries accessed via ccall… I don’t know if that can work on the GPU

martin.d.maas · May 29, 2023, 4:59pm

As a quick workaround, maybe you can implement the recurrence relation in just a few lines of code, which should be very fast

albertomercurio · May 30, 2023, 7:20am

I tried to use the following recursive function

function gen_laguerre(n::Integer, alpha::Number, x::T)::T where T <: BlasFloat
    n == 0 && return 1
    n == 1 && return 1 + alpha - x
    return ((2*n-1+alpha-x)*gen_laguerre(n-1,alpha,x)-(n-1+alpha)*gen_laguerre(n-2,alpha,x))/n
end

but, as discussed here, it returns illegal memory indexing when using the GPU. Moreover, trying it with che CPU, I noticed that it is very slow compared to the HypergeometricFunctions.jl solution.

Per · May 30, 2023, 8:12am

The problem seems to be that HypergeometricFunctions.M might call HypergeometricFunctions._₁F₁general which calls gamma on an integer argument, resulting in promotion from Int64 to Float64.

If you pass only Float32 arguments, it will probably work, e.g.

function _genlaguerre(n::Integer, α::Integer, x::T) where {T<:AbstractFloat} 
    binomial(n+α, n) * HypergeometricFunctions.M(T(-n), T(α+1), x)
end

albertomercurio · May 30, 2023, 11:09pm

Still the same error

function _genlaguerre(n::Integer, α::Integer, x::T) where {T<:AbstractFloat} 
    binomial(n+α, n) * HypergeometricFunctions.M(T(-n), T(α+1), x)
end

A = CUDA.rand(100)
_genlaguerre.(10, 3, A)

GPU broadcast resulted in non-concrete element type Any.
This probably means that the function you are broadcasting contains an error or type instability.

Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] copy
   @ ~/.julia/packages/GPUArrays/TnEpb/src/host/broadcast.jl:34 [inlined]
 [3] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(_genlaguerre2), Tuple{Int64, Int64, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}})
   @ Base.Broadcast ./broadcast.jl:873
 [4] top-level scope
   @ In[26]:2

Per · May 31, 2023, 1:26pm

I got a different error, but yes, it still does not work.

albertomercurio · June 8, 2023, 9:10am

I found another simple method which is much faster than the recursive one, but a little bit slower than HypergeometricFunctions solution. However is fails to compile.

using SpecialFunctions
using LinearAlgebra: BlasFloat
using CUDA
CUDA.allowscalar(false)

function laguerre(n::Int, alpha::Int, x::T) where {T<:BlasFloat}
    L = zero(x)
    for k = 0:n
        L += ((-1)^k * binomial(n + alpha, n - k) * x^k) / factorial(k)
    end
    return L
end

A = CUDA.rand(100)

laguerre.(5, 1, A)

InvalidIRError: compiling MethodInstance for (::GPUArrays.var"#broadcast_kernel#26")(::CUDA.CuKernelContext, ::CuDeviceVector{Float32, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(laguerre), Tuple{Int64, Int64, Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to print_to_string(xs...) @ Base strings/io.jl:133)
Stacktrace:
 [1] string
   @ ./strings/io.jl:185
 [2] factorial_lookup
   @ ./combinatorics.jl:19
 [3] factorial
   @ ./combinatorics.jl:27
 [4] laguerre
   @ ./In[13]:6
 [5] _broadcast_getindex_evalf
   @ ./broadcast.jl:683
 [6] _broadcast_getindex
   @ ./broadcast.jl:656
 [7] getindex
   @ ./broadcast.jl:610
 [8] broadcast_kernel
   @ ~/.julia/packages/GPUArrays/t0LfC/src/host/broadcast.jl:59
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
...

dlfivefifty · June 8, 2023, 9:22am

You should be using the 3-term recurrence, not special functions like binomial and factorial. This is what ClassicalOrthogonalPolynomials.jl does:

julia> A = rand(100);

julia> @btime _genlaguerre.(10, 3, A);
  8.339 μs (3 allocations: 960 bytes)

julia> @btime Laguerre()[A, 1:11] # Compute [L_0.(A) … L_10.(A)]
  3.369 μs (1 allocation: 8.75 KiB)

The extra allocations is because it is building the entire OPs for n = 0:10. There actually is a fast way to get just the 10th OP without any extra allocations, but I’m not sure how well it would play with CUDA, so you might be better off just writing your own implementation of the 3-term recurrence for now.

lrnv · June 8, 2023, 9:29am

albertomercurio:

function laguerre(n::Int, alpha::Int, x::T) where {T<:BlasFloat}
    L = zero(x)
    for k = 0:n
        L += ((-1)^k * binomial(n + alpha, n - k) * x^k) / factorial(k)
    end
    return L
end

You could make it a little faster by only computing the (multiplicative) difference between the terms:

function laguerre(n::Int, α::Int, x::T) where {T<:BlasFloat}
    t = binomial(n+α,n)
    L = t
    for k = 1:n
        t *= (-x) * (n-k+1) /  (k * (k+α))
        L += t
    end
    return L
end

Assuming I derived them correctly. This may even be easier on the GPU compiler ? I dont really know about that.

By the way why is α::Int in your code ? Are’nt the integer alpha cases equivalent to α=0 and a different n (and thus non-generalized Laguerre polynomials, for which there are better recurrences availiable) ?

dlfivefifty · June 8, 2023, 9:47am

Got rid of the allocations so about to tag a release where this is the speed:

julia> @btime Laguerre()[A,10];
  2.675 μs (1 allocation: 896 bytes)

albertomercurio · June 8, 2023, 10:25am

lrnv:

function laguerre(n::Int, α::Int, x::T) where {T<:BlasFloat}
    t = binomial(n+α,n)
    L = t
    for k = 1:n
        t *= (-x) * (n-k+1) /  (k * (k+α))
        L += t
    end
    return L
end
Assuming I derived them correctly. This may even be easier on the GPU compiler ? I dont really know about that.

It is for sure faster than the previous implementation on the CPU. But it still fails on the GPU.

When alpha is integer yes, it is the same of the non-generalized case. But in general ita can be a real number. In my implementation I need it integer, so I can actually use the non-generalized version.

albertomercurio · June 8, 2023, 10:32am

Thanks, the performances are similar to those of the implementation suggested by @lrnv, with a little bit of more allocations.

However, for now, all of this implementations fails to run on the GPU.

lrnv · June 8, 2023, 10:46am

So maybe you should concentrate on the non-generalized case, to use only factorials (that you could pre-compute and store somewhere) instead of binomial coefficients (which are potentially one of the reasons it fails on gpu ?)

dlfivefifty · June 8, 2023, 12:24pm

But you can make your own implementation in 3 lines that can easily be used on a GPU:

function laguerre(n, x)
       p0,p1 = 0.0,1.0
       for k = 0:n-1
           p1,p0 = ((2k+1)/(k+1) - x/(k+1))*p1 - k/(k+1)*p0,p1
       end
       p1
end

It has the same performance as ClassicalOrthogonalPolynomials.jl:

julia> @btime laguerre.(10, A);
  2.425 μs (3 allocations: 960 bytes)

lrnv · June 8, 2023, 12:25pm

I have another idea which might be needed if you need to compute all (non-generalized) Laguerre polynomials L_0,...,L_n each time: use their generating function and TaylorDiff.jl’s TaylorScalar implementation to get at once, with a maximum of automatically precompiled constants, all the values of these polynomials.

From wikipedia, the generating function of Laguerre polynomials writes:

G(t) = \sum_{n=0}^\infty t^n L_n^{(\alpha)}(x)= \frac{1}{(1-t)^{\alpha+1}} e^{-tx/(1-t)}.

And thus:

function laguerre(n::Int, alpha::Int, x::T) where {T}
    L = zero(x)
    for k = 0:n
        L += ((-1)^k * binomial(n + alpha, n - k) * x^k) / factorial(k)
    end
    return L
end

using TaylorDiff
G(t,x,α) = exp(-t*x/(1-t))/((1-t)^(α+1))
const n = 10
const facts = factorial.(0:n)
Lₙ(x,α) = G(TaylorScalar{Float64, n+1}(0., 1.),x,α).value ./ facts



x = rand()
α = 2

using BenchmarkTools
@btime laguerre(n, α, x)
@btime Lₙ(x, α)

Which gives:

julia> @btime laguerre(n, α, x)
  349.296 ns (1 allocation: 16 bytes)
44.53956475248684

julia> @btime Lₙ(x, α)
  234.384 ns (1 allocation: 144 bytes)
11-element Vector{Float64}:
  1.0
  2.8894999652795006
  5.564104989954617
  8.925300424661943
 12.880233216342091
 17.341499322835475
 22.226937067531274
 27.45942630179896
 32.96669324538088
 38.681120877334735
 44.53956475248665

julia>

dlfivefifty · June 8, 2023, 12:26pm

@lrnv This is completely unnecessary. Computing orthogonal polynomials is a solved problem. I’m an expert at numerics and orthogonal polynomials: you should only ever be using the 3-term recurrence as in the code I just posted.

lrnv · June 8, 2023, 12:28pm

I trust you of course, I am not an expert of the subject at all. The method i just proposed is perhaps more general, and probably as you say useless for this particular case. However i like it for it’s simplicity

dlfivefifty · June 8, 2023, 12:29pm

All orthogonal polynomials have 3-term recurrences and for classical OPs they are simple raitonal expressions:

https://dlmf.nist.gov/18.9

Topic		Replies	Views
Is there a smarter way to evaluate an float power? Performance	27	1640	February 15, 2023
Calculate associated Legendre polynomials on the GPU GPU question , cuda , lux	3	85	March 27, 2025
Could this simple function go faster? Performance	15	1219	November 22, 2020
Can this be written even faster? (CPU) Performance	34	841	February 9, 2024
Illegal memory indexing for broadcast of a recursive function GPU gpu , indexing , cuda , array , memory	6	446	May 31, 2023

Laguerre polynomials on GPU

Related topics