For evaluating polynomials by Horner’s rule, I’ve found it’s much better to inline and unroll the whole calculation so that there is no loop, just a sequence of fma’s. I’ve gotten big speedups (2-3x) in this way for the evaluation of special functions in Julia compared to traditional Fortran implementations (CEPHES etc) that used loops over arrays of coefficients.
Julia’s evalpoly
function does this unrolling for you given a tuple of coefficients. It even uses a fancier non-Horner recurrence for complex arguments that saves a factor of ~2 in the number of multiplications (and is significantly faster than Horner in practice).
And there is even a package that generates inlined SIMD-based polynomial evaluation, which beats the scalar algorithm for degrees > 16 or so: GitHub - augustt198/SIMDPoly.jl