I have a much faster version of evalpoly. Why is it faster?

I was playing around with different ways to implement the evalpoly function.

TLDR

  • My best version is about twice as fast as the version currently in base and works with custom array indices.
  • In this case while loops seem to be faster than for loops.
  • the base function evalpoly doesn’t use @inbounds, eventhough it improves performance with little risk of invalid indices. (The coefficient vector must guarantee that firstindex, lastindex and all integer indices in between are valid)
  • chunking Horner’s method into chunks of two may improve performance further.

Here is a plot summarising performance.

evalpoly refers to the version in base Julia. Here are the implementations of the others.

while loop
function ep_while(x, cs::AbstractVector)
	i = lastindex(cs)

	@inbounds out = cs[i]
	i -= 1

	while i >= firstindex(cs)
		@inbounds out = muladd(out, x, cs[i])
		i -= 1
	end

	return out
end
four at a time
function ep_four(x, cs::AbstractVector)
	i = lastindex(cs)

	@inbounds out = cs[i]
	i -= 1

	fi = firstindex(cs)
	@inbounds while i > fi + 3
		out = muladd(out, x, cs[i])
		i -= 1
		out = muladd(out, x, cs[i])
		i -= 1
		out = muladd(out, x, cs[i])
		i -= 1
		out = muladd(out, x, cs[i])
		i -= 1
	end

	while i >= fi
		@inbounds out = muladd(out, x, cs[i])
		i -= 1
	end

	return out
end
Two at a time
function ep_two(x, cs::AbstractVector)
	i = lastindex(cs)

	out = @inbounds cs[i]
	i -= 1

	fi = firstindex(cs)
	@inbounds while i > fi
		out = muladd(out, x, cs[i])
		out = muladd(out, x, cs[i-1])

		i -= 2
	end

	return i == fi ? muladd(out, x, @inbounds(cs[fi])) : out
end
for loop
function ep_for(x, cs::AbstractVector)
	i = lastindex(cs)
	
	@inbounds out = cs[i]
	i -= 1
	
	for j in i:-1:firstindex(cs)
		out = @inbounds muladd(out, x, cs[j])
	end

	return out
end
instruction level parallelism
function ep_ilp(x, cs::AbstractVector)
	fi = firstindex(cs)
	i  = lastindex(cs)
	
	x2 = x*x
	x3 = x2*x
	
	@inbounds out = cs[i]
	i -= 1

	@inbounds while i >= fi + 3
		# the idea here is to remove data dependencies
		# to make use of instruction level parallelism
		out = out * x3 + cs[i] * x2 + cs[i-1] * x + cs[i-2]
		i -= 3
	end

	while i >= fi
		@inbounds out = muladd(out, x, cs[i])
		i -= 1
	end

	return out
end

The plot uses the following benchmark

plot benchmark
setup(n) = (randn(Float64), [randn(Float64) for _ in 1:n])

bdata_base = map(ns) do k
	b = @benchmark evalpoly(x, c) setup=((x, c) = setup($k)) seconds=0.1
	minimum(b).time
end

Finally, I have a benchmark which tests performance for random numbers of coefficients. This may be closer to real world code if the number of coefficients is not constant. This benchmark yields:

evalpoly                      : 1.0
while loop                    : 0.52
four at a time                : 0.65
two at a time                 : 0.55
for loop                      : 0.91
instruction level parallelism : 0.59

where the numbers represent the minimum time divided by the base Julia implementation’s time. Here is the benchmark code:

variable coefficient number benchmark
nsamples = 5000
n = 20

function setup_bench(n, nsamples)
	xs = randn(Float64, nsamples)
	js = rand(1:n-1, nsamples)
	cs = [randn(Float64, j) for j in 2:n]

	return xs, cs, js
end

function bench!(f::F, (xs, cs, js)) where F
	s = 0.0
	for i in 1:length(js)
		@inbounds s += f(xs[i], cs[js[i]])
	end

	return s
end

@benchmark bench!(evalpoly, s) setup=(s = setup_bench(n, nsamples)) evals=1 samples=100_000

Questions

  • Why is a while loop so much faster?
  • Might it be worth chunking into groups of two?
  • Can others replicate my findings?
  • Shall I make a pull request for the simple while loop implementation?

I’m on the following system

Version Info
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 8
1 Like

Tagging @Mason (and @jeff.bezanson) since they wrote (and commited) the base implementation.

A lot of the answer is that evalpoly has mostly been optimized for Tuple coefficients since then the whole thing gets unrolled. An Array is pretty much guaranteed to be at least 2x slower because it has to actually get the elements out.

4 Likes

This is confirmed by the fact that the not @generated branch and the evalpoly(x, cs::AbstractVector) both go to the same _evalpoly implementation.

It’s still curious that a while loop is so much faster though.

It’s not surprising to me that you can get a factor of 2 compared to the simple for loop that we have now, and it might be worth a PR.

The Base implementation was mainly optimized for the case where the coefficients are a tuple, in which case the whole computation can be unrolled, because the main usage of evalpoly (formerly @evalpoly) is currently for things like SpecialFunctions.jl where the polynomial is known statically. (An even more optimized implementation for that case is SIMDPoly.jl, or StaticPolynomials.jl for a multivariate generalization.)

For arbitrary (dynamic length) polynomials, you’re usually better off not representing them in coefficient form (i.e. in a monomial basis), but instead using something like Chebyshev bases (e.g. as in ApproxFun.jl or FastChebInterp.jl), in which case the analogue of Horner’s rule is a Clenshaw recurrence. The famous exception, of course, is evaluating polynomials on the complex unit circle, in which case you have some type of discrete Fourier transform.

11 Likes

I made a PR.

2 Likes

I’m completely new to Julia (coming from Python) and I was wondering why there is no vectorized implementation for the argument x (i.e. evaluating the polynomial in multiple points). Google’ing led me to this (recent) question/post.

Is it intentional that a user should loop over x themselves (so first create a result vector similar to x and then explicitly loop? Or am I so newbie that I’m missing out on some concepts of Julia… Might be very well the case. Thanks for any clarification.

Are you thinking about broadcasting? Short version, if you have a scalar function f(x), the “vectorized” version is f.(xArray).

5 Likes

you may be aware of this already, but “vectorized” in Python most of the times simply means “loop in C/C++”, since Julia is not-slow, a native Julia function can be easily applied to multiple inputs, via explicit loop, or “broadcasting” syntax .

4 Likes

Yep. But it doesn’t work for evalpoly… Or at least, I don’t know how to tell the code that the x argument should be broadcasted and not the p.

julia> p = rand(100); x = rand(1000); evalpoly.(x,p)
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 1000 and 100")Stacktrace:
 [1] _bcs1
   @ .\broadcast.jl:516 [inlined]
 [2] _bcs
   @ .\broadcast.jl:510 [inlined]
 [3] broadcast_shape
   @ .\broadcast.jl:504 [inlined]
 [4] combine_axes
   @ .\broadcast.jl:499 [inlined]
 [5] instantiate
   @ .\broadcast.jl:281 [inlined]
 [6] materialize(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(evalpoly), Tuple{Vector{Float64}, Vector{Float64}}})
   @ Base.Broadcast .\broadcast.jl:860
 [7] top-level scope
   @ REPL[4]:1

evalpoly.(x,Ref(p)) or evalpoly.(x,(p,))

See also How to broadcast over only certain function arguments?

6 Likes

Thanks! I did not pick that up from reading the manual… Sorry.

Could you briefly explain what both options do exactly? Or point me to the right page in the manual?

See Broadcasting in the manual:

Sometimes, you want a container (like an array) that would normally participate in broadcast to be “protected” from broadcast’s behavior of iterating over all of its elements. By placing it inside another container (like a single element Tuple ) broadcast will treat it as a single value.

(I noticed that it doesn’t mention Ref, so I just submitted a PR.)

3 Likes

Is there a problem with using also square brackets as a shielding container?

evalpoly.(x, [p]) 

The compiler can currently eliminate the allocation of a Ref, but not an Array.

This should change one day, in which case creating an Array will be fine from a performance-perspective.

However, a Vector is also a 1-dimensional container, while Refs are scalars.
The length of a vector is not known at compile time, and that information also currently is also not const-propped from the constructor, either.

8 Likes