Performance of Horner's method on Vector Polynomial vs Multiplication by Vandermonde Matrix

We all know that Hornerβs method should be the fastest way to evaluate a polynomial. So it holds that should also be true for a monomial vector polynomial, the coefficients which can be stored in a matrix.

Assuming you have n polynomials of degree k (you can also assume n<=(k+1) ), you can store the coefficients of those polynomials in a matrix B which is size (k+1) x n .

You can then use a vandermonde matrix V of size m x (k+1) where m is the number of values you want to calculate along those polynomials. Calculating the matrix V takes m(k-1) multiplications.

If you work it all out calculating V then V x B should take (mn(k+1)+m(k-1)) multiplications. Whereas hornerβs method on n polynomials of degree k at m values should take (nmk) multiplications.

So the ratio of hornerβs method over the matrix multiplication is nk/(nk+n+k-1), which for a degree of 3 is 2/3.

So using LoopVectorization.jl I was hoping that I could make hornerβs method faster than naively calculating V then V x B, but no matter how I recast the calculation of horners method, it comes out as fast or slower than the other method.

Does anyone know how to get that 30% benefit of hornerβs?

Below is the basic implementation that is being used for benchmarking

Code
``````using LoopVectorization
using LinearAlgebra
using BenchmarkTools
using StaticArrays

##
function lvgemm!(C, A, B)
@turbo for m β axes(A,1), n β axes(B,2)
Cmn = zero(eltype(C))
for k β axes(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
##
function vander_lv!(V::AbstractMatrix, x::AbstractVector, k)
m = length(x)
@turbo for j = 1:m
@inbounds V[j,1] = one(x[j])
@inbounds V[j,2] = x[j]
end
@turbo for i = 3:k+1, j = 1:m
@inbounds V[j,i] = x[j] * V[j,i-1]
end
return V
end

##
function naive_calcV_mulVB!(VB::AbstractMatrix,V::AbstractMatrix,B::AbstractMatrix, p::AbstractVector)
k=size(B,1)-1
vander_lv!(V,p,k)
lvgemm!(VB,V,B)
return VB
end

##
function horners_nocalcV_mulVB!(VB::AbstractMatrix,B::AbstractMatrix, p::AbstractVector)
n_p=length(p)
k=size(B,1)-1
B_row=B[k+1,:]
@turbo for m β axes(B,2), n β axes(p,1)
VB[n,m] = B_row[m]
end
for j = 1:k
B_row=B[k+1-j,:]
@turbo for m β axes(B,2), n β axes(p,1)
# VB[n,m] = VB[n,m]*p[n] + B_row[m]
end
end
return VB
end
``````

Then for the test cases, I have been using

``````num_p = 32
k = 3
p = rand(num_p)
V = zeros(eltype(p),(num_p,k+1))
V_T=V'
VB = zeros(eltype(p),(num_p,k+1))

VB_T=zeros(eltype(p),(k+1,num_p))'

B=MMatrix{k+1,k+1}(rand(k+1,k+1))
B_T=B'
``````

Finally, the results are as follows:

``````julia> @benchmark naive_calcV_mulVB!(VB,V,B,p)
BenchmarkTools.Trial: 10000 samples with 955 evaluations.
Range (min β¦ max):  75.345 ns β¦ 919.735 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     90.490 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   93.788 ns Β±  24.101 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββββββββββββββββββ β   β                                    β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
75.3 ns       Histogram: log(frequency) by time       181 ns <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark horners_nocalcV_mulVB!(VB,B,p)
BenchmarkTools.Trial: 10000 samples with 957 evaluations.
Range (min β¦ max):  73.515 ns β¦ 315.667 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     75.901 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   78.286 ns Β±  11.196 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

β β             β   β                                        β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
73.5 ns       Histogram: log(frequency) by time       134 ns <

Memory estimate: 0 bytes, allocs estimate: 0.

``````
1 Like

Mostly likely the matrix-vector multiplication is as fast as Horner here because itβs done with LAPACK, which is very expertly, very carefully tuned for efficiency.

Does replacing

with

``````    B_row=@view B[k+1,:]
``````

and

with

``````        B_row=@view B[k+1-j,:]
``````

help?
As for analysis, with these kinds of algorithms, memory accesses (reads, writes) and allocations are much more pertinent than multiplication.

This all looks like hand-written loops to me, no LAPACK.

Both slices and row-wise access is suboptimal. Though you seem to be using an `MMatrix`, so allocations are still zero, but change the orientation of `B` and slice along columns for some performance improvement. Also, `B` could be an `SMatrix` instead of an `MMatrix`.

But when you start using StaticArrays (`M` or `S`), itβs not exactly clear anymore if any performance gain is attributable to Horner or to `StaticArrays`.

Also, make sure to use `\$` interpolation when benchmarking, that makes a difference for me.

2 Likes

I noticed one thing. In the Horner implementation you have the `@turbo` annotation in a specific place:

``````for j = 1:k  # <- no @turbo here
B_row=B[k+1-j,:]
@turbo for m β axes(B,2)  # <- @turbo here
for n β axes(p,1)
# VB[n,m] = VB[n,m]*p[n] + B_row[m]
end
end
end
``````

If the `@turbo` annotation is moved to the outer loop, I get a 30% speedup (I also need to do some other stuff, as mentioned in my previous post). However, the result of the calculation is wrong.

So you have fewer operations with Horner, but it also introduces an order dependence, so that LoopVectorization cannot re-order the iterations as freely.

Iβm not certain, but that might be a reason the 30% never materializes.

Thank you for the post this is an interesting problem (but must admit havenβt had time to work it out).

As others have mentioned the dependence of Hornerβs method on each previous iteration will not allow this to vectorize well. It is important to note that Hornerβs scheme is optimal in the sense that it minimizes the number of multiplications and additions, but you can beat it using Estrinβs scheme or with a higher order Horner method. I would imagine that you might need to unroll some of these operations as I doubt LoopVectorization.jl (or should) be able to make those optimizations for you. After all, these methods will give slightly different results.

Here, is a second order Horner schemeβ¦

``````@inline function even_odd_poly_add(x, P)
N = length(P)
xx = x*x

out = P[end]
out2 = P[end-1]

for i in N-2:-2:2
end
if iszero(rem(N, 2))
else
end
end
``````

And then benchmarking for different degree polynomials.

``````# N = 5
julia> @btime evalpoly(\$(Ref(1.2))[], \$(ntuple(n -> (-1)^n / n, 5)))
2.113 ns (0 allocations: 0 bytes)
-0.8627199999999999

julia> @btime even_odd_poly_add(\$(Ref(1.2))[], \$(ntuple(n -> (-1)^n / n, 5)))
2.346 ns (0 allocations: 0 bytes)
-0.8627199999999999

# N = 10
julia> @btime evalpoly(\$(Ref(1.2))[], \$(ntuple(n -> (-1)^n / n, 10)))
3.079 ns (0 allocations: 0 bytes)
-0.38845094765714294

julia> @btime even_odd_poly_add(\$(Ref(1.2))[], \$(ntuple(n -> (-1)^n / n, 10)))
2.967 ns (0 allocations: 0 bytes)
-0.38845094765714316

# N = 50
julia> @btime evalpoly(\$(Ref(1.2))[], \$(ntuple(n -> (-1)^n / n, 50)))
23.274 ns (0 allocations: 0 bytes)
81.32088598046435

julia> @btime even_odd_poly_add(\$(Ref(1.2))[], \$(ntuple(n -> (-1)^n / n, 50)))
10.625 ns (0 allocations: 0 bytes)
81.32088598046445

# N = 150
julia> @btime evalpoly(\$(Ref(1.2))[], \$(ntuple(n -> (-1)^n / n, 150)))
89.602 ns (0 allocations: 0 bytes)
2.2769534597705355e9

julia> @btime even_odd_poly_add(\$(Ref(1.2))[], \$(ntuple(n -> (-1)^n / n, 150)))
41.843 ns (0 allocations: 0 bytes)
2.2769534597705407e9
``````

So for low degree polynomials straight Hornerβs (`evalpoly`) will win. But after a certain degree (N>10) the vectorized version will be able to beat it until it maxes out somewhere around 2x faster for large degree. Now of course, you can use more accumulators to reduce this by 4x for large degree polynomials. This is a slightly different problem than what you gave but interesting nonetheless.

2 Likes

Ah! I missed the βCode >β button and thus didnβt see the source code. Thanks.

I had tried and noticed this as well. In writing that code originally LoopVectorization.jl documentation made me suspect that would happen, as the ordering is important. The funny thing is that it does work in the `vander_lv!` method. Though I would ask @elrod if this is always going to work? Is there any guarantee that if using @turbo returns a correct answer it will always do so?