Was interested in testing this case (hoping I didn’t mess up the Clenshaw algorithm) where everything was unrolled with constant coefficients.

## Clenshaw code and constants

```
function clenshaw_chebyshev(x, c)
x2 = 2x
c0 = c[end-1]
c1 = c[end]
for i in length(c)-2:-1:1
c0, c1 = c[i] - c1, muladd(c1, x2, c0)
end
return muladd(c1, x, c0)
end
const n5 = (-1.0, 0.5, -0.3333333333333333, 0.25, -0.2)
const n10 = (-1.0, 0.5, -0.3333333333333333, 0.25, -0.2, 0.16666666666666666, -0.14285714285714285, 0.125, -0.1111111111111111, 0.1)
const n25 = (-1.0, 0.5, -0.3333333333333333, 0.25, -0.2, 0.16666666666666666, -0.14285714285714285, 0.125, -0.1111111111111111, 0.1, -0.09090909090909091, 0.08333333333333333, -0.07692307692307693, 0.07142857142857142, -0.06666666666666667, 0.0625, -0.058823529411764705, 0.05555555555555555, -0.05263157894736842, 0.05, -0.047619047619047616, 0.045454545454545456, -0.043478260869565216, 0.041666666666666664, -0.04)
const n50 = (-1.0, 0.5, -0.3333333333333333, 0.25, -0.2, 0.16666666666666666, -0.14285714285714285, 0.125, -0.1111111111111111, 0.1, -0.09090909090909091, 0.08333333333333333, -0.07692307692307693, 0.07142857142857142, -0.06666666666666667, 0.0625, -0.058823529411764705, 0.05555555555555555, -0.05263157894736842, 0.05, -0.047619047619047616, 0.045454545454545456, -0.043478260869565216, 0.041666666666666664, -0.04, 0.038461538461538464, -0.037037037037037035, 0.03571428571428571, -0.034482758620689655, 0.03333333333333333, -0.03225806451612903, 0.03125, -0.030303030303030304, 0.029411764705882353, -0.02857142857142857, 0.027777777777777776, -0.02702702702702703, 0.02631578947368421, -0.02564102564102564, 0.025, -0.024390243902439025, 0.023809523809523808, -0.023255813953488372, 0.022727272727272728, -0.022222222222222223, 0.021739130434782608, -0.02127659574468085, 0.020833333333333332, -0.02040816326530612, 0.02)
```

## Benchmark summary

```
# 5 degree
julia> @btime clenshaw_chebyshev(x, n5) setup=(x=rand())
2.355 ns (0 allocations: 0 bytes)
julia> @btime evalpoly(x, n5) setup=(x=rand())
2.353 ns (0 allocations: 0 bytes)
# 10 degree
julia> @btime clenshaw_chebyshev(x, n10) setup=(x=rand())
3.329 ns (0 allocations: 0 bytes)
julia> @btime evalpoly(x, n10) setup=(x=rand())
3.281 ns (0 allocations: 0 bytes)
# 25 degree
julia> @btime clenshaw_chebyshev(x, n25) setup=(x=rand())
10.777 ns (0 allocations: 0 bytes)
julia> @btime evalpoly(x, n25) setup=(x=rand())
8.826 ns (0 allocations: 0 bytes)
# 50 degree
julia> @btime clenshaw_chebyshev(x, n50) setup=(x=rand())
28.561 ns (0 allocations: 0 bytes)
julia> @btime evalpoly(x, n50) setup=(x=rand())
23.273 ns (0 allocations: 0 bytes)
```

So for degrees <10 Clenshaw is just a bit slower (probably need a better benchmark) and for larger degree polynomial the Clenshaw algorithm was about **\approx 23\% slower** than Horner scheme using `evalpoly`

which is actually more comparable than I originally thought.

For high degree polynomials (>10), it might be helpful to implement the SIMD variant of the `evalpoly`

function (which could be similarly employed for Clenshaw)

## SIMD 2nd degree Horner

```
@inline function horner_degree2(x, P)
N = length(P)
xx = x*x
out = P[end]
out2 = P[end-1]
for i in N-2:-2:2
out = muladd(xx, out, P[i])
out2 = muladd(xx, out2, P[i-1])
end
if iszero(rem(N, 2))
return muladd(out, x, out2)
else
out = muladd(xx, out, P[1])
return muladd(x, out2, out)
end
end
julia> @btime horner_degree2(x, n5) setup=(x=rand())
2.581 ns (0 allocations: 0 bytes)
julia> @btime horner_degree2(x, n10) setup=(x=rand())
2.585 ns (0 allocations: 0 bytes)
julia> @btime horner_degree2(x, n25) setup=(x=rand())
5.258 ns (0 allocations: 0 bytes)
julia> @btime horner_degree2(x, n50) setup=(x=rand())
13.268 ns (0 allocations: 0 bytes)
```

So the 50 degree polynomial using the SIMD Horner scheme (unrolling factor of 2) is **43 \% faster** achieving close to the optimal 2x speedup.

I think it should be said that the SIMD variants will produce slightly different results compared to the regular Horner’s scheme (\approx 0.5 ULP) with second order schemes producing slightly higher errors between \approx 0.5 - 1.5 ULP). There are a couple instances where the second order scheme might be more accurate in alternating series but that depends on the coefficients. Though this can be tested if coefficients are known.

I’m not for sure it’s always optimal to use the SIMD implementation as it depends on the application.

Are we evaluating the same polynomial a million times (might be best to SIMD individual evaluations)?

Or are we evaluating a high degree polynomial just a couple of times (probably best to SIMD the polynomial evaluation)?

This also comes up in higher order Chebyshev basis like approximating 2D functions where it might be best to SIMD the evaluations in the second dimension… (should probably make sure this is happening in some of my code…)