Performance of Horner's method on Vector Polynomial vs Multiplication by Vandermonde Matrix

We all know that Horner’s method should be the fastest way to evaluate a polynomial. So it holds that should also be true for a monomial vector polynomial, the coefficients which can be stored in a matrix.

Assuming you have n polynomials of degree k (you can also assume n<=(k+1) ), you can store the coefficients of those polynomials in a matrix B which is size (k+1) x n .

You can then use a vandermonde matrix V of size m x (k+1) where m is the number of values you want to calculate along those polynomials. Calculating the matrix V takes m(k-1) multiplications.

If you work it all out calculating V then V x B should take (mn(k+1)+m(k-1)) multiplications. Whereas horner’s method on n polynomials of degree k at m values should take (nmk) multiplications.

So the ratio of horner’s method over the matrix multiplication is nk/(nk+n+k-1), which for a degree of 3 is 2/3.

So using LoopVectorization.jl I was hoping that I could make horner’s method faster than naively calculating V then V x B, but no matter how I recast the calculation of horners method, it comes out as fast or slower than the other method.

Does anyone know how to get that 30% benefit of horner’s?

Below is the basic implementation that is being used for benchmarking

Code
using LoopVectorization
using LinearAlgebra
using BenchmarkTools
using StaticArrays

##
function lvgemm!(C, A, B)
    @turbo for m ∈ axes(A,1), n ∈ axes(B,2)
        Cmn = zero(eltype(C))
        for k ∈ axes(A,2)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end
##
function vander_lv!(V::AbstractMatrix, x::AbstractVector, k)
    m = length(x)
    @turbo for j = 1:m
        @inbounds V[j,1] = one(x[j])
        @inbounds V[j,2] = x[j]
    end
    @turbo for i = 3:k+1, j = 1:m
        @inbounds V[j,i] = x[j] * V[j,i-1]
    end
    return V
end

##
function naive_calcV_mulVB!(VB::AbstractMatrix,V::AbstractMatrix,B::AbstractMatrix, p::AbstractVector)
    k=size(B,1)-1
    vander_lv!(V,p,k)
    lvgemm!(VB,V,B)
    return VB
end

##
function horners_nocalcV_mulVB!(VB::AbstractMatrix,B::AbstractMatrix, p::AbstractVector)
    n_p=length(p)
    k=size(B,1)-1
    B_row=B[k+1,:]
    @turbo for m ∈ axes(B,2), n ∈ axes(p,1)
        VB[n,m] = B_row[m]
    end
    for j = 1:k
        B_row=B[k+1-j,:]
        @turbo for m ∈ axes(B,2), n ∈ axes(p,1)
            # VB[n,m] = VB[n,m]*p[n] + B_row[m]
            VB[n,m] = muladd(VB[n,m],p[n], B_row[m])
        end
    end
    return VB
end

Then for the test cases, I have been using

num_p = 32
k = 3
p = rand(num_p)
V = zeros(eltype(p),(num_p,k+1))
V_T=V'
VB = zeros(eltype(p),(num_p,k+1))

VB_T=zeros(eltype(p),(k+1,num_p))'

B=MMatrix{k+1,k+1}(rand(k+1,k+1))
B_T=B'

Finally, the results are as follows:

julia> @benchmark naive_calcV_mulVB!(VB,V,B,p)
BenchmarkTools.Trial: 10000 samples with 955 evaluations.
 Range (min … max):  75.345 ns … 919.735 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     90.490 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   93.788 ns Β±  24.101 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‡β–…β–‚β–‚β–β–β–‚β–β–ˆβ–†β–†β–ƒβ–‚β–β–β–β–β–β– ▁   ▁                                    β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–†β–†β–†β–‡β–†β–…β–…β–†β–‡β–‡β–…β–…β–„β–… β–ˆ
  75.3 ns       Histogram: log(frequency) by time       181 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark horners_nocalcV_mulVB!(VB,B,p)
BenchmarkTools.Trial: 10000 samples with 957 evaluations.
 Range (min … max):  73.515 ns … 315.667 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     75.901 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   78.286 ns Β±  11.196 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ β–ˆ             β–„   β–ƒ                                        ▁
  β–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–…β–†β–„β–„β–†β–„β–ƒβ–…β–†β–‡β–ˆβ–‡β–†β–‡β–ˆβ–†β–†β–‡β–…β–„β–ƒβ–ƒβ–„β–„β–†β–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–†β–…β–„β–ƒβ–„β–†β–†β–†β–…β–„β–„β–‚β–„β–„β–„β–„β–„β–„β–ƒβ–ƒβ–ƒβ–…β–† β–ˆ
  73.5 ns       Histogram: log(frequency) by time       134 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

1 Like

Mostly likely the matrix-vector multiplication is as fast as Horner here because it’s done with LAPACK, which is very expertly, very carefully tuned for efficiency.

Does replacing

with

    B_row=@view B[k+1,:]

and

with

        B_row=@view B[k+1-j,:]

help?
As for analysis, with these kinds of algorithms, memory accesses (reads, writes) and allocations are much more pertinent than multiplication.

This all looks like hand-written loops to me, no LAPACK.

Both slices and row-wise access is suboptimal. Though you seem to be using an MMatrix, so allocations are still zero, but change the orientation of B and slice along columns for some performance improvement. Also, B could be an SMatrix instead of an MMatrix.

But when you start using StaticArrays (M or S), it’s not exactly clear anymore if any performance gain is attributable to Horner or to StaticArrays.

Also, make sure to use $ interpolation when benchmarking, that makes a difference for me.

2 Likes

I noticed one thing. In the Horner implementation you have the @turbo annotation in a specific place:

for j = 1:k  # <- no @turbo here
        B_row=B[k+1-j,:]
        @turbo for m ∈ axes(B,2)  # <- @turbo here
            for n ∈ axes(p,1)
                # VB[n,m] = VB[n,m]*p[n] + B_row[m]
                VB[n,m] = muladd(VB[n,m],p[n], B_row[m])
            end
        end
    end

If the @turbo annotation is moved to the outer loop, I get a 30% speedup (I also need to do some other stuff, as mentioned in my previous post). However, the result of the calculation is wrong.

So you have fewer operations with Horner, but it also introduces an order dependence, so that LoopVectorization cannot re-order the iterations as freely.

I’m not certain, but that might be a reason the 30% never materializes.

Thank you for the post this is an interesting problem (but must admit haven’t had time to work it out).

As others have mentioned the dependence of Horner’s method on each previous iteration will not allow this to vectorize well. It is important to note that Horner’s scheme is optimal in the sense that it minimizes the number of multiplications and additions, but you can beat it using Estrin’s scheme or with a higher order Horner method. I would imagine that you might need to unroll some of these operations as I doubt LoopVectorization.jl (or should) be able to make those optimizations for you. After all, these methods will give slightly different results.

Here, is a second order Horner scheme…

@inline function even_odd_poly_add(x, P)
    N = length(P)
    xx = x*x

    out = P[end]
    out2 = P[end-1]

    for i in N-2:-2:2
        out = muladd(xx, out, P[i])
        out2 = muladd(xx, out2, P[i-1])
    end
    if iszero(rem(N, 2)) 
        return muladd(out, x, out2) 
    else 
        out = muladd(xx, out, P[1]) 
        return muladd(x, out2, out) 
    end
end

And then benchmarking for different degree polynomials.

# N = 5
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 5)))
  2.113 ns (0 allocations: 0 bytes)
-0.8627199999999999

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 5)))
  2.346 ns (0 allocations: 0 bytes)
-0.8627199999999999

# N = 10
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 10)))
  3.079 ns (0 allocations: 0 bytes)
-0.38845094765714294

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 10)))
  2.967 ns (0 allocations: 0 bytes)
-0.38845094765714316

# N = 50
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 50)))
  23.274 ns (0 allocations: 0 bytes)
81.32088598046435

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 50)))
  10.625 ns (0 allocations: 0 bytes)
81.32088598046445

# N = 150
julia> @btime evalpoly($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 150)))
  89.602 ns (0 allocations: 0 bytes)
2.2769534597705355e9

julia> @btime even_odd_poly_add($(Ref(1.2))[], $(ntuple(n -> (-1)^n / n, 150)))
  41.843 ns (0 allocations: 0 bytes)
2.2769534597705407e9

So for low degree polynomials straight Horner’s (evalpoly) will win. But after a certain degree (N>10) the vectorized version will be able to beat it until it maxes out somewhere around 2x faster for large degree. Now of course, you can use more accumulators to reduce this by 4x for large degree polynomials. This is a slightly different problem than what you gave but interesting nonetheless.

2 Likes

Ah! I missed the β€œCode >” button and thus didn’t see the source code. Thanks.

I had tried and noticed this as well. In writing that code originally LoopVectorization.jl documentation made me suspect that would happen, as the ordering is important. The funny thing is that it does work in the vander_lv! method. Though I would ask @elrod if this is always going to work? Is there any guarantee that if using @turbo returns a correct answer it will always do so?