Julia analog to R's poly function, orthogonal polynomials by QR decomposition

Hey guys!
Im trying to replicate some existing R code in julia for a proyect, but such R code depends on the stats base poly function which, given a collection of points, it computes orthogonal polynomials evaluated at those points.

I havent been able to find an analgoue in Julia, so I’m trying to replicate the necessary functionalities of the poly function. Regardless, after checking the code, it uses a QR decomposition to obtain such evaluations. I’m trying to understand what is doing, but the only similar thing I´ve found on books is the utilization of the QR decomposition to compute OLS.

Does anybody know if there is an analogous function in Julia, or know of a reference which can be usefull to understand poly’s background work?

2 Likes

I had to search around a bit to understand exactly what kind of orthogonality R’s poly function refers to. It seems to be orthogonality with respect to the following inner product:

\langle p, q \rangle = \sum_{i = 1}^n p(x_i) q(x_i) \, ,

where the x_i are the points provided by the user. Let’s label this family of polynomials p_k(x), where k is the degree of the polynomial.

Since you’re only interested in the values p_k(x_i), rather than the coefficients defining the polynomials, what we’re actually trying to calculate is just a matrix Q defined as

Q = \begin{bmatrix} p_0(x_1) & p_1(x_1) & \cdots & p_m(x_1) \\ \vdots & \vdots & ~ & \vdots \\ p_0(x_n) & p_1(x_n) & \cdots & p_m(x_n) \end{bmatrix}

which should satisfy the following properties:

  1. The columns should be orthogonal (this is the orthogonality of the polynomials)
  2. The column space of Q should be the vector space spanned by polynomials of degree up to m evaluated on the x_i, which is mostly easily expressed as the column space of the Vandermonde matrix V:
V = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^m \\ \vdots & \vdots & \vdots & ~ & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^m \end{bmatrix}
  1. The first k columns of Q should have the same span as the first k columns of V (this ensures that p_k(x) is a polynomial of degree k)

These are exactly the properties satisfied by the factor Q in the “thin” QR decomposition of the Vandermonde matrix, V = Q R. Thus, what we need to do is a) form the Vandermonde matrix, and b) compute its QR decomposition. Each column of the resulting Q contains the values of one of the polynomials p_k(x). Here’s an implementation

using LinearAlgebra

function poly(x, degree)
    V = reduce(hcat, (x.^k for k in 0:degree))
    F = qr(V)
    return F.Q[:, 2:(degree + 1)]
end

Note one last detail: R’s poly doesn’t return the first column, which is just constant; to mirror this behavior, the implementation above drops the first column of Q and returns columns 2:(degree + 1).

Let’s verify that it works:

julia> poly(1:10, 3)
10×3 Matrix{Float64}:
 -0.495434    0.522233    0.453425
 -0.385337    0.174078   -0.151142
 -0.275241   -0.0870388  -0.377854
 -0.165145   -0.261116   -0.334671
 -0.0550482  -0.348155   -0.12955
  0.0550482  -0.348155    0.12955
  0.165145   -0.261116    0.334671
  0.275241   -0.0870388   0.377854
  0.385337    0.174078    0.151142
  0.495434    0.522233   -0.453425

The first two columns here match this StackOverflow answer demonstrating usage of R’s poly: https://stackoverflow.com/a/19484326, so I’d say this passes the sniff test.

14 Likes

Does anybody know the math well enough to see if anything in this package or the other linked packages are related at all? GitHub - jverzani/SpecialPolynomials.jl: Families of polynomials

1 Like

@Benny in short: orthogonal polynomials are orthogonal with respect to some inner products. These inner products can be defined by integrating the products of the polynomials against a distribution (see e.g. the Hermite polynomials for a normal distribution and the Laguerre polynomials for an exponential distribution); some such polynomials are implemented in the package you mentioned. The QR based method above instead enforces orthogonality “in sample”; i.e. with respect to the inner product defined above.

Tl;dr both create orthogonal polynomials, but they differ in whether they require orthogonality by an inner product defined in-sample or with respect to a given distribution

4 Likes

@danielwe it makes total sense. Thanks, it really helps a lot!

This functionality replica will form part of my undergrad thesis. Do you know any reference from which I can find the relationship between the QR decomposition and the polynomials which?

Thanks a lot btw!

I’m not familiar with the relevant literature, unfortunately. My previous post was just pieced together in my attempt to understand the poly docstring. The StackOverflow entry I linked, as well as the Wikipedia page on the Vandermonde matrix, were helpful in the process, but are obviously not the kind of primary sources you want for your thesis.

Anyway, the idea boils down to orthogonalizing the columns of the polynomial regression design matrix (the Vandermonde matrix) because orthogonal (uncorrelated) features give more robust and interpretable coefficients. Maybe one of the references listed on the Wikipedia page on polynomial regression contains what you need?

1 Like

Did you look at the references in R’s poly docstring?

Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S. Wadsworth & Brooks/Cole.

Kennedy, W. J. Jr and Gentle, J. E. (1980) Statistical Computing Marcel Dekker.

2 Likes

I checked the first, but it only describes the usage of the poly function. I’ve also been trying to find the second, but I havent been able to. To this moment what I´ve found the most usefull is section 7.1.2 from the book “Linear Regression Analysis” by George Saber and Alan Lee. It describes the first theoretical elements that you described at first, but I havent been able to find the connections to the QR decomposition.

The book “Plane answers to complex questions” by Ronald Christensen talks a bit about the fact that the factorization is obtained using Gram-Schmith. I guess this last point is a very close step to what you were mentioning before, but I havent been able to make the connection/jump.

Update!!

As complement to the answer given by @danielwe, I just found a bit more information on James E. Gentle’s “Matrix Algebra, Theory Computations, and Applications in Statistics”. In particular 8.8.2 mentions a similar train of thought as the highlighted solution. What could be added is that, if the knots are in the [-1,1] interval, then the resulting orthogonal vectors correspond to Legendre polynomials. In particular, the former are discrete Legendre polynomials. I havent understood is whether all polynomials obtained this way are some form of discrete Legendre polynomials though.

Here’s a point I find undercommunicated in (numerical) linear algebra: The fundamental factorizations LU and QR are simply implementing the basic procedures you learn to do by hand in your first linear algebra course. LU gives you the row echelon form (the U factor), and QR gives you the orthogonalization of the columns (the Q factor). Internally, they may use more sophisticated algorithms than Gaussian elimination (LU) or Gram-Schmidt orthogonalization (QR), but they compute the same thing.

That’s the final link from Gram-Schmidt orthogonalization to the QR decomposition. The latter is simply how we do orthogonalization on computers.

This is only true in the limit of infinitely many uniformly distributed knots, I think. The Legendre polynomials are defined by orthogonality under the following inner product:

\int_{-1}^1 p(x) \, q(x) \, dx \, .

I haven’t heard of discrete Legendre polynomials before, but this reference seems to define them by setting x_i = i - 1 in the discrete inner product from the original answer. Of course, the resulting polynomials can be shifted and scaled to map onto any other interval with uniformly distributed knots, but as soon as you have non-uniform knots, your polynomials are no longer a (discrete) Legendre variant.

Beware that this is numerically unstable if you go to high degree, due to the ill-conditioning of the Vandermonde matrix.

A more accurate algorithm would probably be to apply a Lanczos/Arnoldi process to the operator A = Diagonal(x) starting with the vector of all 1’s.

2 Likes

Good point!

EDIT: This is easy to implement using Krylov.jl, which provides a hermitian_lanczos function:

using LinearAlgebra, Krylov

function poly_krylov(x, degree)
    A = Diagonal(x)
    b = ones(length(x))
    Q = first(hermitian_lanczos(A, b, degree; reorthogonalization=true))
    return Q[:, 2:end]
end
julia> poly_krylov(1:10, 3)
10×3 Matrix{Float64}:
 -0.495434    0.522233   -0.453425
 -0.385337    0.174078    0.151142
 -0.275241   -0.0870388   0.377854
 -0.165145   -0.261116    0.334671
 -0.0550482  -0.348155    0.12955
  0.0550482  -0.348155   -0.12955
  0.165145   -0.261116   -0.334671
  0.275241   -0.0870388  -0.377854
  0.385337    0.174078   -0.151142
  0.495434    0.522233    0.453425

Note that the last column has the opposite sign of what poly gave above, which is OK as orthogonality is invariant under sign flips.

Original answer

Good point! Here’s a Lanczos implementation using KrylovKit.jl:

using LinearAlgebra, KrylovKit

function poly_krylov(x, degree)
    A = Diagonal(x)
    b = ones(length(x))
    for (i, krylov_fact) in enumerate(LanczosIterator(A, b))
        if i == (degree + 1)
            return reduce(hcat, basis(krylov_fact)[2:end])
        end
    end
end
1 Like

In particular, I’ve attached below an implementation of the poly function using an Arnoldi algorithm, made even more accurate by the “twice is enough” trick of doing modified Gram–Schmidt twice, along with an implementation of the Vandermonde+QR approach. The two algorithms give the same answer (modulo sign choices) up to roundoff, but have very different roundoff growth.

To demonstrate this, I computed the relative error (in Float64 precision) compared to the “exact” result computed in BigFloat arithmetic versus the number of points. The results are quite dramatic: the error of the Vandermonde method blows up rapidly due to the exponentially growing condition number, and is completely wrong by n=30 points, whereas the Arnoldi method remains near machine precision:

Code
using LinearAlgebra

@views function poly_arnoldi(x, degree=length(x)-1)
    m = length(x)
    T = float(eltype(x))
    Q = zeros(T, m, degree+1)
    Q[:,1] .= 1/sqrt(m)
    A = Diagonal(x)
    v = zeros(T, m)
    for n = 1:degree
        mul!(v, A, Q[:,n])
        for j = 1:n
            q = Q[:,j]
            v .-= q .* dot(q, v) # modified Gram–Schmidt
        end
        # run modified Gram-Schmidt twice for improved accuracy
        for j = 1:n
            q = Q[:,j]
            v .-= q .* dot(q, v)
        end
        Q[:,n+1] = normalize!(v)
    end
    return Q
end

poly_vander(x, degree=length(x)-1) = Matrix(qr(x .^ (0:degree)').Q)

# compute errors vs n, comparing to BigFloat
n = 2:30
errs = map(n) do n
    x = rand(n)
    Qexact = Float64.(poly_arnoldi(big.(x)))
    Q1 = poly_arnoldi(x)
    Q2 = poly_vander(x)
    Q2 .= Q2 .* sign.(Q1[1,:]') .* sign.(Q2[1,:]')
    norm(Q1 - Qexact) / norm(Qexact), norm(Q2 - Qexact) / norm(Qexact)
end
8 Likes

Here, you’re always computing and orthogonalizing the full square Vandermonde matrix. In the context of polynomial regression, you’ll always be interested in a tall matrix (lower polynomial degree than the number of points), otherwise, your regression is just interpolation. Does the QR approach fare any better under those circumstances?

Also, you’re using randomly sampled x_i, could that further increase the condition number? I imagine that in most regression contexts, the sample points are, if not perfectly uniform, then at least more evenly spread out than iid. pseudorandom numbers. (Keep in mind the x_i here are not the data, they are the points at which the data is sampled, i.e., the “knots” in the regression.)

Nope. The Vandermonde matrix is still exponentially ill conditioned as you increase the number of columns. That’s why it is better to use something like Chebyshev–Vandermonde regression (ala FastChebInterp.jl), which uses orthogonal polynomials rather than monomials, if you want to do regression at high degrees.

Basically, working with polynomials in the monomial basis is almost always a disaster at high degrees, regardless of what you are doing (interpolation, regression, root finding, …).

I updated my example functions to let you pass the degree as a parameter.

Equally spaced points are also infamous for polynomials, no better than uniform random points I believe, though neither is particularly bad for regression (when # points \gg degree). For interpolation, when the number of points is one more than the degree, you should ideally use something like Chebyshev nodes, as otherwise you run into a Runge phenomenon.


For example, I ran the same experiment as in my post above, but used “tall” matrices where the number of points was 100 \times \text{degree}, and made the points equally spaced from x=0 to x=1. It made no difference: Vandermonde QR is still terrible. I also included your poly_krylov function based on KrylovKit.jl (edit: also tried your modified implementation using Krylov.jl, with virtually identical results), and it matched the accuracy of my Arnoldi code:

I find it hard to believe that R is really using the unstable Vandermonde+QR algorithm?

1 Like

Looks like it: here is the source.

I guess people only use this for very low degrees? Otherwise, how could they not have noticed by now that the results are so inaccurate? Someone should file a bug report with them: this is an easily fixed problem, there’s no reason to use an unstable algorithm here.

A simple way to see the error is to use degree=25 with x = range(0,1,length=10^4), in which case the results should be very nearly ±Legendre polynomials (just with a shift from [-1,1] to [0,1], and a scale factor). The first few columns of Q look fine, but by the 25th column it is clearly noisy and wrong with the Vandermonde method.

PS. Note that they initialize the first column to x - mean(x), equivalent to projecting orthogonal to a vector of 1’s. We could do the same thing in the code above to skip a column in the computation, but it shouldn’t make any significant difference in the accuracy.

3 Likes

Note that to be a full replacement for R’s poly function, you should also return the symmetric tridiagonal matrix from the Lanczos process: this is the “Jacobi matrix” of the orthogonal polynomial, analogous to the coefs output of R’s poly.

This defines the coefficients of a three-term recurrence that can then be used to evaluate the polynomials at given points x, and is analogous to calling R’s poly with a coefs input. (I think the API would be cleaner if evaluating the polynomial were a separate function.)

Not sure if this belongs in some existing package? Maybe a small new package in the JuliaApproximation organization where @dlfivefifty has put a bunch of other orthogonal-polynomial stuff?

Oh, I hadn’t noticed that the function also returns the Jacobi matrix.

Not sure if @Daniel_Johansson needs to exactly replicate R’s API for the thesis project, or if something cleaner that serves the same functionality would work?

@danielwe o this moment, I only need to replicate the outcome of this,
tmp ← cbind(rep(1, n))
tmp <-cbind(tmp, poly(x, degree))

I think your first answer @danielwe got it. I just wanted to make sure that I understand what I’m programming the best I can. I doubt I’ll need something else.

@danielwe @stevengj do any of you have some reference (introductory if possible) for this kind of topics (i.e. obtaining orthogonal polynomials provided a collection of knots), or that talk about reparametrization of a function basis using design matrix in general? Or any tips in general?

The former would be very usefull. The thing is that, for my thesis, I’m replicating the code from a statistics paper which relies on this sort of reparemetrizations. For example, a central part of the project relies in obtaining a low rank aproximation of a penalized spline basis (usually second order difference). Such basis being projected into the complement of the nullspace of its associated penalty.

PD: Thank you @stevengj aswell for getting in the conversation, your comments have also been very insightful.

1 Like

If you’re replicating what someone did in R, you may want to use the first answer to match what they did, but if you’re also building your own analysis on top of that, I strongly suggest you use something like the Krylov.jl-based code from the answer that’s now marked as the solution, or frankly just @stevengj‘s Arnoldi process code. As the plots show, the QR approach is simply wrong except at the lowest polynomial orders. (Also, the Krylov/Lanczos/Arnoldi approaches use a modified Gram-Schmidt orthogonalization, so in that sense they may be more understandable and relatable to what you already know than the internals of QR decomposition.)