I have three questions and I was curious if someone can provide some clarity on the results I’m seeing with a couple variations of a simple polynomial evaluation computation.
I wrote a Horner’s method implementation which is performing well, but I was curious about the for loop. Is there additional overhead by including a length(X)
in the range? Is it re-computed with each iteration, or is it only computed once? I am not seeing any noticeable difference with the benchmark between horner
and horner2
below so I’m assuming it isn’t.
Also, I included the @btime values below in the code snippet and I’m seeing a huge improvement in the execution time between wrapping the poly call in a function compared to inlining the expression. Why exactly is that? It looks like there’s a huge benefit in wrapping most expressions in a function to maybe benefit from the optimizer or something?
Finally, by using @fastmath, I was able to just about halve the execution time of the inlined function. Where is @fastmath not appropriate to use? I figured in this case with a very simple computation I wouldn’t have to worry much about error creeping into the approximation, but I am not sure how this would fare if I were manipulating floats rather than integers.
Thanks in advance!
function horner(poly, x)
result = poly[1]
for i in 1:length(poly)-1
result = result * x + poly[i+1]
end
return result
end
function horner2(poly, x)
result = poly[1]
n = length(poly)
for i in 1:n-1
result = result * x + poly[i+1]
end
return result
end
polywithpows(x) = 2x^4 - 2x^3 + 15x^2 + 12x - 6
polyexpanded(x) = 2*x*x*x*x - 2*x*x*x + 15*x*x + 12*x - 6
poly = [2, -2, 15, 12, -6];
x = 4;
@btime horner(poly, x)
#> 19.962 ns (1 allocation: 16 bytes)
@btime horner2(poly, x)
#> 19.597 ns (1 allocation: 16 bytes)
@btime foldl((acc, el) -> acc*x + el, poly[2:end], init=poly[1])
#> 2.161 μs (6 allocations: 208 bytes)
@btime polywithpows(x)
#> 18.328 ns (1 allocation: 16 bytes)
@btime polyexpanded(x)
#> 18.260 ns (1 allocation: 16 bytes)
@btime @fastmath 2x^4 - 2x^3 + 15x^2 + 12x - 6
#> 88.610 ns (3 allocations: 48 bytes)
@btime 2x^4 - 2x^3 + 15x^2 + 12x - 6
#> 142.256 ns (3 allocations: 48 bytes)