# I have a much faster version of evalpoly. Why is it faster?

I was playing around with different ways to implement the `evalpoly` function.

TLDR

• My best version is about twice as fast as the version currently in base and works with custom array indices.
• In this case while loops seem to be faster than for loops.
• the base function `evalpoly` doesn’t use `@inbounds`, eventhough it improves performance with little risk of invalid indices. (The coefficient vector must guarantee that `firstindex`, `lastindex` and all integer indices in between are valid)
• chunking Horner’s method into chunks of two may improve performance further.

Here is a plot summarising performance.

`evalpoly` refers to the version in base Julia. Here are the implementations of the others.

while loop
``````function ep_while(x, cs::AbstractVector)
i = lastindex(cs)

@inbounds out = cs[i]
i -= 1

while i >= firstindex(cs)
@inbounds out = muladd(out, x, cs[i])
i -= 1
end

return out
end
``````
four at a time
``````function ep_four(x, cs::AbstractVector)
i = lastindex(cs)

@inbounds out = cs[i]
i -= 1

fi = firstindex(cs)
@inbounds while i > fi + 3
i -= 1
i -= 1
i -= 1
i -= 1
end

while i >= fi
@inbounds out = muladd(out, x, cs[i])
i -= 1
end

return out
end
``````
Two at a time
``````function ep_two(x, cs::AbstractVector)
i = lastindex(cs)

out = @inbounds cs[i]
i -= 1

fi = firstindex(cs)
@inbounds while i > fi

i -= 2
end

return i == fi ? muladd(out, x, @inbounds(cs[fi])) : out
end
``````
for loop
``````function ep_for(x, cs::AbstractVector)
i = lastindex(cs)

@inbounds out = cs[i]
i -= 1

for j in i:-1:firstindex(cs)
out = @inbounds muladd(out, x, cs[j])
end

return out
end
``````
instruction level parallelism
``````function ep_ilp(x, cs::AbstractVector)
fi = firstindex(cs)
i  = lastindex(cs)

x2 = x*x
x3 = x2*x

@inbounds out = cs[i]
i -= 1

@inbounds while i >= fi + 3
# the idea here is to remove data dependencies
# to make use of instruction level parallelism
out = out * x3 + cs[i] * x2 + cs[i-1] * x + cs[i-2]
i -= 3
end

while i >= fi
@inbounds out = muladd(out, x, cs[i])
i -= 1
end

return out
end
``````

The plot uses the following benchmark

plot benchmark
``````setup(n) = (randn(Float64), [randn(Float64) for _ in 1:n])

bdata_base = map(ns) do k
b = @benchmark evalpoly(x, c) setup=((x, c) = setup(\$k)) seconds=0.1
minimum(b).time
end
``````

Finally, I have a benchmark which tests performance for random numbers of coefficients. This may be closer to real world code if the number of coefficients is not constant. This benchmark yields:

``````evalpoly                      : 1.0
while loop                    : 0.52
four at a time                : 0.65
two at a time                 : 0.55
for loop                      : 0.91
instruction level parallelism : 0.59
``````

where the numbers represent the minimum time divided by the base Julia implementation’s time. Here is the benchmark code:

variable coefficient number benchmark
``````nsamples = 5000
n = 20

function setup_bench(n, nsamples)
xs = randn(Float64, nsamples)
js = rand(1:n-1, nsamples)
cs = [randn(Float64, j) for j in 2:n]

return xs, cs, js
end

function bench!(f::F, (xs, cs, js)) where F
s = 0.0
for i in 1:length(js)
@inbounds s += f(xs[i], cs[js[i]])
end

return s
end

@benchmark bench!(evalpoly, s) setup=(s = setup_bench(n, nsamples)) evals=1 samples=100_000
``````

Questions

• Why is a while loop so much faster?
• Might it be worth chunking into groups of two?
• Can others replicate my findings?
• Shall I make a pull request for the simple while loop implementation?

I’m on the following system

Version Info
``````Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
``````
1 Like

Tagging @Mason (and @jeff.bezanson) since they wrote (and commited) the base implementation.

A lot of the answer is that `evalpoly` has mostly been optimized for `Tuple` coefficients since then the whole thing gets unrolled. An `Array` is pretty much guaranteed to be at least 2x slower because it has to actually get the elements out.

4 Likes

This is confirmed by the fact that the not `@generated` branch and the `evalpoly(x, cs::AbstractVector)` both go to the same `_evalpoly` implementation.

It’s still curious that a while loop is so much faster though.

It’s not surprising to me that you can get a factor of 2 compared to the simple `for` loop that we have now, and it might be worth a PR.

The Base implementation was mainly optimized for the case where the coefficients are a tuple, in which case the whole computation can be unrolled, because the main usage of `evalpoly` (formerly `@evalpoly`) is currently for things like SpecialFunctions.jl where the polynomial is known statically. (An even more optimized implementation for that case is SIMDPoly.jl, or StaticPolynomials.jl for a multivariate generalization.)

For arbitrary (dynamic length) polynomials, you’re usually better off not representing them in coefficient form (i.e. in a monomial basis), but instead using something like Chebyshev bases (e.g. as in ApproxFun.jl or FastChebInterp.jl), in which case the analogue of Horner’s rule is a Clenshaw recurrence. The famous exception, of course, is evaluating polynomials on the complex unit circle, in which case you have some type of discrete Fourier transform.

11 Likes

2 Likes

I’m completely new to Julia (coming from Python) and I was wondering why there is no vectorized implementation for the argument `x` (i.e. evaluating the polynomial in multiple points). Google’ing led me to this (recent) question/post.

Is it intentional that a user should loop over `x` themselves (so first create a result vector `similar` to `x` and then explicitly loop? Or am I so newbie that I’m missing out on some concepts of Julia… Might be very well the case. Thanks for any clarification.

Are you thinking about broadcasting? Short version, if you have a scalar function `f(x)`, the “vectorized” version is `f.(xArray)`.

5 Likes

you may be aware of this already, but “vectorized” in Python most of the times simply means “loop in C/C++”, since Julia is not-slow, a native Julia function can be easily applied to multiple inputs, via explicit loop, or “broadcasting” syntax `.`

4 Likes

Yep. But it doesn’t work for evalpoly… Or at least, I don’t know how to tell the code that the `x` argument should be broadcasted and not the `p`.

``````julia> p = rand(100); x = rand(1000); evalpoly.(x,p)
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 1000 and 100")Stacktrace:
 _bcs1
 _bcs
 combine_axes
 instantiate
 top-level scope
@ REPL:1
``````

`evalpoly.(x,Ref(p))` or ` evalpoly.(x,(p,))`

6 Likes

Thanks! I did not pick that up from reading the manual… Sorry.

Could you briefly explain what both options do exactly? Or point me to the right page in the manual?

Sometimes, you want a container (like an array) that would normally participate in broadcast to be “protected” from broadcast’s behavior of iterating over all of its elements. By placing it inside another container (like a single element `Tuple` ) broadcast will treat it as a single value.

(I noticed that it doesn’t mention `Ref`, so I just submitted a PR.)

3 Likes

Is there a problem with using also square brackets as a shielding container?

``````evalpoly.(x, [p])
``````

The compiler can currently eliminate the allocation of a `Ref`, but not an `Array`.

This should change one day, in which case creating an `Array` will be fine from a performance-perspective.

However, a `Vector` is also a 1-dimensional container, while `Ref`s are scalars.
The length of a vector is not known at compile time, and that information also currently is also not const-propped from the constructor, either.

8 Likes