We all know that Hornerβs method should be the fastest way to evaluate a polynomial. So it holds that should also be true for a monomial vector polynomial, the coefficients which can be stored in a matrix.

Assuming you have n polynomials of degree k (you can also assume n<=(k+1) ), you can store the coefficients of those polynomials in a matrix B which is size (k+1) x n .

You can then use a vandermonde matrix V of size m x (k+1) where m is the number of values you want to calculate along those polynomials. Calculating the matrix V takes m(k-1) multiplications.

If you work it all out calculating V then V x B should take (mn(k+1)+m(k-1)) multiplications. Whereas hornerβs method on n polynomials of degree k at m values should take (nmk) multiplications.

So the ratio of hornerβs method over the matrix multiplication is nk/(nk+n+k-1), which for a degree of 3 is 2/3.

So using LoopVectorization.jl I was hoping that I could make hornerβs method faster than naively calculating V then V x B, but no matter how I recast the calculation of horners method, it comes out as fast or slower than the other method.

Does anyone know how to get that 30% benefit of hornerβs?

Below is the basic implementation that is being used for benchmarking

## Code

```
using LoopVectorization
using LinearAlgebra
using BenchmarkTools
using StaticArrays
##
function lvgemm!(C, A, B)
@turbo for m β axes(A,1), n β axes(B,2)
Cmn = zero(eltype(C))
for k β axes(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
##
function vander_lv!(V::AbstractMatrix, x::AbstractVector, k)
m = length(x)
@turbo for j = 1:m
@inbounds V[j,1] = one(x[j])
@inbounds V[j,2] = x[j]
end
@turbo for i = 3:k+1, j = 1:m
@inbounds V[j,i] = x[j] * V[j,i-1]
end
return V
end
##
function naive_calcV_mulVB!(VB::AbstractMatrix,V::AbstractMatrix,B::AbstractMatrix, p::AbstractVector)
k=size(B,1)-1
vander_lv!(V,p,k)
lvgemm!(VB,V,B)
return VB
end
##
function horners_nocalcV_mulVB!(VB::AbstractMatrix,B::AbstractMatrix, p::AbstractVector)
n_p=length(p)
k=size(B,1)-1
B_row=B[k+1,:]
@turbo for m β axes(B,2), n β axes(p,1)
VB[n,m] = B_row[m]
end
for j = 1:k
B_row=B[k+1-j,:]
@turbo for m β axes(B,2), n β axes(p,1)
# VB[n,m] = VB[n,m]*p[n] + B_row[m]
VB[n,m] = muladd(VB[n,m],p[n], B_row[m])
end
end
return VB
end
```

Then for the test cases, I have been using

```
num_p = 32
k = 3
p = rand(num_p)
V = zeros(eltype(p),(num_p,k+1))
V_T=V'
VB = zeros(eltype(p),(num_p,k+1))
VB_T=zeros(eltype(p),(k+1,num_p))'
B=MMatrix{k+1,k+1}(rand(k+1,k+1))
B_T=B'
```

Finally, the results are as follows:

```
julia> @benchmark naive_calcV_mulVB!(VB,V,B,p)
BenchmarkTools.Trial: 10000 samples with 955 evaluations.
Range (min β¦ max): 75.345 ns β¦ 919.735 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 90.490 ns β GC (median): 0.00%
Time (mean Β± Ο): 93.788 ns Β± 24.101 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
βββββββββββββββββ β β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββ
β
ββ
β
75.3 ns Histogram: log(frequency) by time 181 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark horners_nocalcV_mulVB!(VB,B,p)
BenchmarkTools.Trial: 10000 samples with 957 evaluations.
Range (min β¦ max): 73.515 ns β¦ 315.667 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 75.901 ns β GC (median): 0.00%
Time (mean Β± Ο): 78.286 ns Β± 11.196 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β β β
βββββββ
βββββββ
βββββββββββ
ββββββββββββββ
βββββββ
βββββββββββββ
β β
73.5 ns Histogram: log(frequency) by time 134 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
```