Laguerre polynomials on GPU

This looks rally efficient indeed. Thanks for the link i learned something!

Thanks! it definitely worked! It is slightly slower than the other implementation for higher degrees of the polynomial, but it is still quick. And it works with the GPU.

function laguerre(n::Int, x::T) where {T <: BlasFloat}
    p0, p1 = 0.0, 1.0
    for k = 0:n-1
        p1, p0 = ((2k+1)/(k+1) - x/(k+1))*p1 - k/(k+1)*p0, p1
    end
    p1
end

function _genlaguerre(n::Int, Ξ±::Int, x::T) where {T<:BlasFloat}
    t = binomial(n+Ξ±,n)
    L = t
    for k = 1:n
        t *= (-x) * (n-k+1) /  (k * (k+Ξ±))
        L += t
    end
    return L
end

A = rand(500, 500);
B = cu(A);


@benchmark laguerre.(80, $A) seconds=10

BenchmarkTools.Trial: 214 samples with 1 evaluation.
 Range (min … max):  46.647 ms …  49.016 ms  β”Š GC (min … max): 0.00% … 3.79%
 Time  (median):     46.703 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   46.839 ms Β± 414.309 ΞΌs  β”Š GC (mean Β± Οƒ):  0.17% Β± 0.77%

  β–‚β–ˆβ–„β–                                                          
  β–ˆβ–ˆβ–ˆβ–ˆβ–†β–β–β–β–β–„β–β–β–β–β–β–ˆβ–‡β–†β–†β–…β–†β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ β–…
  46.6 ms       Histogram: log(frequency) by time      48.6 ms <

 Memory estimate: 1.91 MiB, allocs estimate: 2.



@benchmark _genlaguerre.(80, 0, $A) seconds=10

BenchmarkTools.Trial: 342 samples with 1 evaluation.
 Range (min … max):  28.836 ms …  31.543 ms  β”Š GC (min … max): 0.00% … 5.92%
 Time  (median):     29.149 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   29.253 ms Β± 399.656 ΞΌs  β”Š GC (mean Β± Οƒ):  0.26% Β± 1.19%

         β–†β–ˆβ–…β–           ▁                                       
  β–„β–β–β–β–β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–…β–β–β–β–β–β–β–β–β–β–†β–ˆβ–†β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–†β–ˆ β–†
  28.8 ms       Histogram: log(frequency) by time        31 ms <

 Memory estimate: 1.91 MiB, allocs estimate: 2.



@benchmark laguerre.(80, $B) seconds=10

BenchmarkTools.Trial: 2167 samples with 5 evaluations.
 Range (min … max):    6.682 ΞΌs …   1.078 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.019 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   922.869 ΞΌs Β± 296.625 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                                              β–ˆ  
  β–ƒβ–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–ˆ β–‚
  6.68 ΞΌs          Histogram: frequency by time         1.02 ms <

 Memory estimate: 1.28 KiB, allocs estimate: 26.

It would be better to extend it also for the generalized case. But it is what I needed for the moment.

I have only one question. The input array B is a Float32, while the output is Float64. I guess that it comes from the initialization of the variables p0, p1 = 0.0, 1.0. I tried something like p0, p1 = zero(x), one(x), but it doesn’t work.

GPU broadcast resulted in non-concrete element type AbstractFloat.

You also need to make sure the rational expressions have the right types since dividing Int will return a Float64

I previously said, together with @lrnv, that when alpha is a positive integer, the generalized Laguerre can be seen as the case with alpha=0 and different n. However I don’t remember if it is effectively true.

Ok the following code works

function laguerre(n::Int, x::T) where {T <: BlasFloat}
    p0, p1 = zero(T), one(T)
    for k = zero(T):n-1
        p1, p0 = ((2k+1)/(k+1) - x/(k+1))*p1 - k/(k+1)*p0, p1
    end
    p1
end

In my implementation I use integer alphas, which I remeber that they can be mapped to the case of zero alpha and different n of the polynomial. But I don’t remeber the relation, if any.

1 Like

Adding general Ξ± is a very simple modification, see the DLMF link I sent for how to change the rationals

2 Likes

Thanks! The generalized Laguerre Polynomial now works also on the GPU!

function laguerre(n::Int, Ξ±::Number, x::T) where {T <: BlasFloat}
    Ξ± = convert(T, Ξ±)
    p0, p1 = one(T), -x+(Ξ±+1)
    n == 0 && return p0
    for k = one(T):n-1
        p1, p0 = ((2k+Ξ±+1)/(k+1) - x/(k+1))*p1 - (k+Ξ±)/(k+1)*p0, p1
    end
    p1
end

A = CUDA.rand(1000, 1000)


@benchmark laguerre.(10, 2, $A)

BenchmarkTools.Trial: 4153 samples with 5 evaluations.
 Range (min … max):    6.432 ΞΌs … 54.395 ms  β”Š GC (min … max): 0.00% … 1.86%
 Time  (median):     250.764 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   240.418 ΞΌs Β±  1.456 ms  β”Š GC (mean Β± Οƒ):  0.30% Β± 0.05%

                                                             β–ˆ  
  β–†β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–‚β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–ˆ β–‚
  6.43 ΞΌs         Histogram: frequency by time          251 ΞΌs <

 Memory estimate: 1.28 KiB, allocs estimate: 26.
1 Like

There is no need for one(T) here, since all the numerators are of type T. And iterating over a float range can be significantly slower than over an Int range. On the CPU the difference is dramatic, but on the GPU it seems small.

Anyway, change one(T) to 1.

1 Like