Allocation-efficient calculation of Hermite (and Laguerre) polynomials

un-XY: I’m working on a project that involves a lot of simulation, for which I need to compute basis-functions based on the simulated quantities repeatedly. I’m writing code that allows me to flexibly mix-and-match various types of basis-functions and interactions between them, and I’m trying to do this in a way that minimises memory and allocation.

Concrete problem:
How can I improve the function ComputeBasis! below, to reduce the number of allocations? I currently get 12 allocations for the entire function call, but doing a similar thing with monomials (x, x^2, x^3, etc) takes substantially fewer allocations.

using BenchmarkTools;

struct HermiteOperation
    init_domain::Int # What column in X to use to compute Hermite polynomials in
    range::Vector{Int} # What columns in MX to write the Hermite polynomials to
    order::Int # What order of Hermite polynomial to consider
end

function ComputeBasis!(MX::AbstractMatrix{Float64}, X::AbstractMatrix{Float64}, op::HermiteOperation)
    tmp = X[:, op.init_domain]
    MX[:, op.range[1]] .= tmp
    MX[:, op.range[2]] .= muladd.(tmp, tmp, -1)
    @views for i in 3:op.order
        tmp .= MX[:, op.range[i - 1]]
        MX[:, op.range[i]] .= muladd.(MX[:, op.init_domain], tmp , - (i-1) * MX[:, op.range[i - 2]])
    end
    return MX
end

struct PolynomialOperation
    init_domain::Int
    range::Vector{Int}
    order::Int
end

function ComputeBasis!(MX::AbstractMatrix{Float64}, X::AbstractMatrix{Float64}, op::PolynomialOperation)
    tmp = X[:, op.init_domain]
    MX[:, op.range[1]] = tmp
    @views for i in 2:op.order
        tmp .*= X[:, op.init_domain]
        MX[:, op.range[i]] .= tmp
    end
    return MX
end

function RunTest()
    X = [-2.;0;;-5;0]
    K = 10
    MX = Matrix{Float64}(undef, size(X, 1), K)
    op = HermiteOperation(1, 1:K, K)
    #op = PolynomialOperation(1, 1:K, K)
    ComputeBasis!(MX, X, op)
end

@benchmark RunTest()

Any suggestions to reduce allocations/memory use when using PolynomialOperation are also very welcome, of course!

Further motivation: I will be writing analogous functions that act on vectors rather than matrices (where vectors correspond in my setting to 1 simulation path), and I need to call ComputeBasis! repeatedly within an optimize call, which I need to run for many simulated paths.

Let me just add that I have the same problem for my implementation of Laguerre polynomials, which are given by

struct LaguerreOperation <: BasisOperation
    init_domain::Int
    range::Vector{Int}
    order::Int
    α::Float64
end

function ComputeBasis!(MX::AbstractMatrix{Float64}, X::AbstractMatrix{Float64}, op::LaguerreOperation)
    tmp = ones(size(X, 1))
    @views MX[:, op.range[1]] .= 1 + op.α .- X[:, op.init_domain]
    @views for i in 2:op.order
        MX[:, op.range[i]] .= muladd.(2 * i + op.α - 1 .- X[:, op.init_domain], MX[:, op.range[i - 1]], -(i - 1 + op.α)*tmp)/i
        if i < op.order
            tmp .= MX[:, op.range[i - 1]]
        end
    end
    return MX
end

I think your code is efficient, we can add @inbounds to disable bounds checking and there is no need for temp variable in the for loop

function ComputeBasis11!(MX::AbstractMatrix{Float64}, X::AbstractMatrix{Float64}, op::HermiteOperation)
           @views begin
               tmp = X[:, op.init_domain] 
               MX[:, op.range[1]] .= tmp
               MX[:, op.range[2]] .= muladd.(tmp, tmp, -1)
               for i in 3:op.order
                   @inbounds MX[:, op.range[i]] .= muladd.(tmp, MX[:, op.range[i - 1]], - (i-1) * MX[:, op.range[i - 2]])
               end
           end
           return MX
       end

thank you! I’m guessing the same probably applies for my Laguerre polynomial code?

After some tests, we can write code with just loops only

function ComputeBasis_No_Vec!(MX::AbstractMatrix{Float64}, X::AbstractMatrix{Float64}, op::HermiteOperation)
       
       tmp = X[:, op.init_domain]
       @views for (i, x) in enumerate(eachcol(MX))
           for j in eachindex(x)
               if i == 1 
                   x[j] = tmp[j]
                   continue
               end 
               if i == 2
                   x[j] = tmp[j] * tmp[j] - 1
                   continue
               end
               x[j] = tmp[j] * MX[j, op.range[i-1]] - (i-1) * MX[j, op.range[i - 2]]
           end
       end
       MX
       end

which is nearly 2X with my benchmarks .

julia> @btime RunTest()
  2.699 μs (17 allocations: 1.33 KiB)

julia> @btime RunTest_No_Vec()
  1.287 μs (10 allocations: 800 bytes)

That’s amazing, thank you! (I’ve added an @views at the line where tmp is defined, to cut out another allocation). I ended up using the following, as it allowed me to consistently generalise the methods to other choices of basis functions:

function ComputeBasisInner!(MX::AbstractMatrix{Float64}, X::AbstractMatrix{Float64}, op::HermiteOperation)
    @views tmp = X[:, op.init_domain]
    @views for (k, i) in enumerate(op.range)
        for j in eachindex(tmp)
            if k == 1
                @inbounds MX[j, i] = tmp[j]
                continue
            end
            if i == 2
                @inbounds MX[j, i] = muladd(tmp[j], tmp[j], -1)
                continue
            end
            @inbounds MX[j, i] = muladd(tmp[j], MX[j, op.range[k - 1]], - (i-1) * MX[j, op.range[k - 2]])
        end
    end
    return MX
end