Performance of Horner's method on Vector Polynomial vs Multiplication by Vandermonde Matrix

We all know that Horner’s method should be the fastest way to evaluate a polynomial. So it holds that should also be true for a monomial vector polynomial, the coefficients which can be stored in a matrix.

Assuming you have n polynomials of degree k (you can also assume n<=(k+1) ), you can store the coefficients of those polynomials in a matrix B which is size (k+1) x n .

You can then use a vandermonde matrix V of size m x (k+1) where m is the number of values you want to calculate along those polynomials. Calculating the matrix V takes m(k-1) multiplications.

If you work it all out calculating V then V x B should take (mn(k+1)+m(k-1)) multiplications. Whereas horner’s method on n polynomials of degree k at m values should take (nmk) multiplications.

So the ratio of horner’s method over the matrix multiplication is nk/(nk+n+k-1), which for a degree of 3 is 2/3.

So using LoopVectorization.jl I was hoping that I could make horner’s method faster than naively calculating V then V x B, but no matter how I recast the calculation of horners method, it comes out as fast or slower than the other method.

Does anyone know how to get that 30% benefit of horner’s?

Below is the basic implementation that is being used for benchmarking

using LoopVectorization
using LinearAlgebra
using BenchmarkTools
using StaticArrays

function lvgemm!(C, A, B)
    @turbo for m ∈ axes(A,1), n ∈ axes(B,2)
        Cmn = zero(eltype(C))
        for k ∈ axes(A,2)
            Cmn += A[m,k] * B[k,n]
        C[m,n] = Cmn
function vander_lv!(V::AbstractMatrix, x::AbstractVector, k)
    m = length(x)
    @turbo for j = 1:m
        @inbounds V[j,1] = one(x[j])
        @inbounds V[j,2] = x[j]
    @turbo for i = 3:k+1, j = 1:m
        @inbounds V[j,i] = x[j] * V[j,i-1]
    return V

function naive_calcV_mulVB!(VB::AbstractMatrix,V::AbstractMatrix,B::AbstractMatrix, p::AbstractVector)
    return VB

function horners_nocalcV_mulVB!(VB::AbstractMatrix,B::AbstractMatrix, p::AbstractVector)
    @turbo for m ∈ axes(B,2), n ∈ axes(p,1)
        VB[n,m] = B_row[m]
    for j = 1:k
        @turbo for m ∈ axes(B,2), n ∈ axes(p,1)
            # VB[n,m] = VB[n,m]*p[n] + B_row[m]
            VB[n,m] = muladd(VB[n,m],p[n], B_row[m])
    return VB

Then for the test cases, I have been using

num_p = 32
k = 3
p = rand(num_p)
V = zeros(eltype(p),(num_p,k+1))
VB = zeros(eltype(p),(num_p,k+1))



Finally, the results are as follows:

julia> @benchmark naive_calcV_mulVB!(VB,V,B,p)
BenchmarkTools.Trial: 10000 samples with 955 evaluations.
 Range (min … max):  75.345 ns … 919.735 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     90.490 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   93.788 ns Β±  24.101 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‡β–…β–‚β–‚β–β–β–‚β–β–ˆβ–†β–†β–ƒβ–‚β–β–β–β–β–β– ▁   ▁                                    β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–†β–†β–†β–‡β–†β–…β–…β–†β–‡β–‡β–…β–…β–„β–… β–ˆ
  75.3 ns       Histogram: log(frequency) by time       181 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark horners_nocalcV_mulVB!(VB,B,p)
BenchmarkTools.Trial: 10000 samples with 957 evaluations.
 Range (min … max):  73.515 ns … 315.667 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     75.901 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   78.286 ns Β±  11.196 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ β–ˆ             β–„   β–ƒ                                        ▁
  β–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–…β–†β–„β–„β–†β–„β–ƒβ–…β–†β–‡β–ˆβ–‡β–†β–‡β–ˆβ–†β–†β–‡β–…β–„β–ƒβ–ƒβ–„β–„β–†β–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–†β–…β–„β–ƒβ–„β–†β–†β–†β–…β–„β–„β–‚β–„β–„β–„β–„β–„β–„β–ƒβ–ƒβ–ƒβ–…β–† β–ˆ
  73.5 ns       Histogram: log(frequency) by time       134 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

1 Like

Mostly likely the matrix-vector multiplication is as fast as Horner here because it’s done with LAPACK, which is very expertly, very carefully tuned for efficiency.

Does replacing


    B_row=@view B[k+1,:]



        B_row=@view B[k+1-j,:]

As for analysis, with these kinds of algorithms, memory accesses (reads, writes) and allocations are much more pertinent than multiplication.

This all looks like hand-written loops to me, no LAPACK.

Both slices and row-wise access is suboptimal. Though you seem to be using an MMatrix, so allocations are still zero, but change the orientation of B and slice along columns for some performance improvement. Also, B could be an SMatrix instead of an MMatrix.

But when you start using StaticArrays (M or S), it’s not exactly clear anymore if any performance gain is attributable to Horner or to StaticArrays.

Also, make sure to use $ interpolation when benchmarking, that makes a difference for me.


I noticed one thing. In the Horner implementation you have the @turbo annotation in a specific place:

for j = 1:k  # <- no @turbo here
        @turbo for m ∈ axes(B,2)  # <- @turbo here
            for n ∈ axes(p,1)
                # VB[n,m] = VB[n,m]*p[n] + B_row[m]
                VB[n,m] = muladd(VB[n,m],p[n], B_row[m])

If the @turbo annotation is moved to the outer loop, I get a 30% speedup (I also need to do some other stuff, as mentioned in my previous post). However, the result of the calculation is wrong.

So you have fewer operations with Horner, but it also introduces an order dependence, so that LoopVectorization cannot re-order the iterations as freely.

I’m not certain, but that might be a reason the 30% never materializes.

Thank you for the post this is an interesting problem (but must admit haven’t had time to work it out).

As others have mentioned the dependence of Horner’s method on each previous iteration will not allow this to vectorize well. It is important to note that Horner’s scheme is optimal in the sense that it minimizes the number of multiplications and additions, but you can beat it using Estrin’s scheme or with a higher order Horner method. I would imagine that you might need to unroll some of these operations as I doubt LoopVectorization.jl (or should) be able to make those optimizations for you. After all, these methods will give slightly different results.

Here, is a second order Horner scheme…

@inline function even_odd_poly_add(x, P)
    N = length(P)
    xx = x*x

    out = P[end]
    out2 = P[end-1]

    for i in N-2:-2:2
        out = muladd(xx, out, P[i])
        out2 = muladd(xx, out2, P[i-1])
    if iszero(rem(N, 2)) 
        return muladd(out, x, out2) 
        out = muladd(xx, out, P[1]) 
        return muladd(x, out2, out) 

And then benchmarking for different degree polynomials.

# N = 5
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 5)))
  2.113 ns (0 allocations: 0 bytes)

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 5)))
  2.346 ns (0 allocations: 0 bytes)

# N = 10
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 10)))
  3.079 ns (0 allocations: 0 bytes)

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 10)))
  2.967 ns (0 allocations: 0 bytes)

# N = 50
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 50)))
  23.274 ns (0 allocations: 0 bytes)

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 50)))
  10.625 ns (0 allocations: 0 bytes)

# N = 150
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 150)))
  89.602 ns (0 allocations: 0 bytes)

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 150)))
  41.843 ns (0 allocations: 0 bytes)

So for low degree polynomials straight Horner’s (evalpoly) will win. But after a certain degree (N>10) the vectorized version will be able to beat it until it maxes out somewhere around 2x faster for large degree. Now of course, you can use more accumulators to reduce this by 4x for large degree polynomials. This is a slightly different problem than what you gave but interesting nonetheless.


Ah! I missed the β€œCode >” button and thus didn’t see the source code. Thanks.

I had tried and noticed this as well. In writing that code originally LoopVectorization.jl documentation made me suspect that would happen, as the ordering is important. The funny thing is that it does work in the vander_lv! method. Though I would ask @elrod if this is always going to work? Is there any guarantee that if using @turbo returns a correct answer it will always do so?