I was playing around with different ways to implement the `evalpoly`

function.

**TLDR**

- My best version is about twice as fast as the version currently in base and works with custom array indices.
- In this case while loops seem to be faster than for loops.
- the base function
`evalpoly`

doesn’t use`@inbounds`

, eventhough it improves performance with little risk of invalid indices. (The coefficient vector must guarantee that`firstindex`

,`lastindex`

and all integer indices in between are valid) - chunking Horner’s method into chunks of two may improve performance further.

Here is a plot summarising performance.

`evalpoly`

refers to the version in base Julia. Here are the implementations of the others.

## while loop

```
function ep_while(x, cs::AbstractVector)
i = lastindex(cs)
@inbounds out = cs[i]
i -= 1
while i >= firstindex(cs)
@inbounds out = muladd(out, x, cs[i])
i -= 1
end
return out
end
```

## four at a time

```
function ep_four(x, cs::AbstractVector)
i = lastindex(cs)
@inbounds out = cs[i]
i -= 1
fi = firstindex(cs)
@inbounds while i > fi + 3
out = muladd(out, x, cs[i])
i -= 1
out = muladd(out, x, cs[i])
i -= 1
out = muladd(out, x, cs[i])
i -= 1
out = muladd(out, x, cs[i])
i -= 1
end
while i >= fi
@inbounds out = muladd(out, x, cs[i])
i -= 1
end
return out
end
```

## Two at a time

```
function ep_two(x, cs::AbstractVector)
i = lastindex(cs)
out = @inbounds cs[i]
i -= 1
fi = firstindex(cs)
@inbounds while i > fi
out = muladd(out, x, cs[i])
out = muladd(out, x, cs[i-1])
i -= 2
end
return i == fi ? muladd(out, x, @inbounds(cs[fi])) : out
end
```

## for loop

```
function ep_for(x, cs::AbstractVector)
i = lastindex(cs)
@inbounds out = cs[i]
i -= 1
for j in i:-1:firstindex(cs)
out = @inbounds muladd(out, x, cs[j])
end
return out
end
```

## instruction level parallelism

```
function ep_ilp(x, cs::AbstractVector)
fi = firstindex(cs)
i = lastindex(cs)
x2 = x*x
x3 = x2*x
@inbounds out = cs[i]
i -= 1
@inbounds while i >= fi + 3
# the idea here is to remove data dependencies
# to make use of instruction level parallelism
out = out * x3 + cs[i] * x2 + cs[i-1] * x + cs[i-2]
i -= 3
end
while i >= fi
@inbounds out = muladd(out, x, cs[i])
i -= 1
end
return out
end
```

The plot uses the following benchmark

## plot benchmark

```
setup(n) = (randn(Float64), [randn(Float64) for _ in 1:n])
bdata_base = map(ns) do k
b = @benchmark evalpoly(x, c) setup=((x, c) = setup($k)) seconds=0.1
minimum(b).time
end
```

Finally, I have a benchmark which tests performance for random numbers of coefficients. This may be closer to real world code if the number of coefficients is not constant. This benchmark yields:

```
evalpoly : 1.0
while loop : 0.52
four at a time : 0.65
two at a time : 0.55
for loop : 0.91
instruction level parallelism : 0.59
```

where the numbers represent the minimum time divided by the base Julia implementation’s time. Here is the benchmark code:

## variable coefficient number benchmark

```
nsamples = 5000
n = 20
function setup_bench(n, nsamples)
xs = randn(Float64, nsamples)
js = rand(1:n-1, nsamples)
cs = [randn(Float64, j) for j in 2:n]
return xs, cs, js
end
function bench!(f::F, (xs, cs, js)) where F
s = 0.0
for i in 1:length(js)
@inbounds s += f(xs[i], cs[js[i]])
end
return s
end
@benchmark bench!(evalpoly, s) setup=(s = setup_bench(n, nsamples)) evals=1 samples=100_000
```

**Questions**

- Why is a while loop so much faster?
- Might it be worth chunking into groups of two?
- Can others replicate my findings?
- Shall I make a pull request for the simple while loop implementation?

I’m on the following system

## Version Info

```
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 8
```