Laguerre polynomials on GPU

lrnv · June 8, 2023, 12:32pm

This looks rally efficient indeed. Thanks for the link i learned something!

albertomercurio · June 8, 2023, 6:42pm

dlfivefifty:

But you can make your own implementation in 3 lines that can easily be used on a GPU:
function laguerre(n, x)
       p0,p1 = 0.0,1.0
       for k = 0:n-1
           p1,p0 = ((2k+1)/(k+1) - x/(k+1))*p1 - k/(k+1)*p0,p1
       end
       p1
end

Thanks! it definitely worked! It is slightly slower than the other implementation for higher degrees of the polynomial, but it is still quick. And it works with the GPU.

function laguerre(n::Int, x::T) where {T <: BlasFloat}
    p0, p1 = 0.0, 1.0
    for k = 0:n-1
        p1, p0 = ((2k+1)/(k+1) - x/(k+1))*p1 - k/(k+1)*p0, p1
    end
    p1
end

function _genlaguerre(n::Int, α::Int, x::T) where {T<:BlasFloat}
    t = binomial(n+α,n)
    L = t
    for k = 1:n
        t *= (-x) * (n-k+1) /  (k * (k+α))
        L += t
    end
    return L
end

A = rand(500, 500);
B = cu(A);


@benchmark laguerre.(80, $A) seconds=10

BenchmarkTools.Trial: 214 samples with 1 evaluation.
 Range (min … max):  46.647 ms …  49.016 ms  ┊ GC (min … max): 0.00% … 3.79%
 Time  (median):     46.703 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.839 ms ± 414.309 μs  ┊ GC (mean ± σ):  0.17% ± 0.77%

  ▂█▄▁                                                          
  ████▆▁▁▁▁▄▁▁▁▁▁█▇▆▆▅▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▅
  46.6 ms       Histogram: log(frequency) by time      48.6 ms <

 Memory estimate: 1.91 MiB, allocs estimate: 2.



@benchmark _genlaguerre.(80, 0, $A) seconds=10

BenchmarkTools.Trial: 342 samples with 1 evaluation.
 Range (min … max):  28.836 ms …  31.543 ms  ┊ GC (min … max): 0.00% … 5.92%
 Time  (median):     29.149 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.253 ms ± 399.656 μs  ┊ GC (mean ± σ):  0.26% ± 1.19%

         ▆█▅▁           ▁                                       
  ▄▁▁▁▁▄▆████▅▁▁▁▁▁▁▁▁▁▆█▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆█ ▆
  28.8 ms       Histogram: log(frequency) by time        31 ms <

 Memory estimate: 1.91 MiB, allocs estimate: 2.



@benchmark laguerre.(80, $B) seconds=10

BenchmarkTools.Trial: 2167 samples with 5 evaluations.
 Range (min … max):    6.682 μs …   1.078 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.019 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   922.869 μs ± 296.625 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                              █  
  ▃▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁█ ▂
  6.68 μs          Histogram: frequency by time         1.02 ms <

 Memory estimate: 1.28 KiB, allocs estimate: 26.

It would be better to extend it also for the generalized case. But it is what I needed for the moment.

I have only one question. The input array B is a Float32, while the output is Float64. I guess that it comes from the initialization of the variables p0, p1 = 0.0, 1.0. I tried something like p0, p1 = zero(x), one(x), but it doesn’t work.

GPU broadcast resulted in non-concrete element type AbstractFloat.

dlfivefifty · June 8, 2023, 7:16pm

You also need to make sure the rational expressions have the right types since dividing Int will return a Float64

albertomercurio · June 8, 2023, 7:16pm

I previously said, together with @lrnv, that when alpha is a positive integer, the generalized Laguerre can be seen as the case with alpha=0 and different n. However I don’t remember if it is effectively true.

albertomercurio · June 8, 2023, 7:59pm

Ok the following code works

function laguerre(n::Int, x::T) where {T <: BlasFloat}
    p0, p1 = zero(T), one(T)
    for k = zero(T):n-1
        p1, p0 = ((2k+1)/(k+1) - x/(k+1))*p1 - k/(k+1)*p0, p1
    end
    p1
end

In my implementation I use integer alphas, which I remeber that they can be mapped to the case of zero alpha and different n of the polynomial. But I don’t remeber the relation, if any.

dlfivefifty · June 8, 2023, 8:03pm

Adding general α is a very simple modification, see the DLMF link I sent for how to change the rationals

albertomercurio · June 8, 2023, 11:46pm

Thanks! The generalized Laguerre Polynomial now works also on the GPU!

function laguerre(n::Int, α::Number, x::T) where {T <: BlasFloat}
    α = convert(T, α)
    p0, p1 = one(T), -x+(α+1)
    n == 0 && return p0
    for k = one(T):n-1
        p1, p0 = ((2k+α+1)/(k+1) - x/(k+1))*p1 - (k+α)/(k+1)*p0, p1
    end
    p1
end

A = CUDA.rand(1000, 1000)


@benchmark laguerre.(10, 2, $A)

BenchmarkTools.Trial: 4153 samples with 5 evaluations.
 Range (min … max):    6.432 μs … 54.395 ms  ┊ GC (min … max): 0.00% … 1.86%
 Time  (median):     250.764 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   240.418 μs ±  1.456 ms  ┊ GC (mean ± σ):  0.30% ± 0.05%

                                                             █  
  ▆▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂█ ▂
  6.43 μs         Histogram: frequency by time          251 μs <

 Memory estimate: 1.28 KiB, allocs estimate: 26.

DNF · June 9, 2023, 11:35am

There is no need for one(T) here, since all the numerators are of type T. And iterating over a float range can be significantly slower than over an Int range. On the CPU the difference is dramatic, but on the GPU it seems small.

Anyway, change one(T) to 1.