# Improving Computational Performance

I have three questions and I was curious if someone can provide some clarity on the results I’m seeing with a couple variations of a simple polynomial evaluation computation.

I wrote a Horner’s method implementation which is performing well, but I was curious about the for loop. Is there additional overhead by including a `length(X)` in the range? Is it re-computed with each iteration, or is it only computed once? I am not seeing any noticeable difference with the benchmark between `horner` and `horner2` below so I’m assuming it isn’t.

Also, I included the @btime values below in the code snippet and I’m seeing a huge improvement in the execution time between wrapping the poly call in a function compared to inlining the expression. Why exactly is that? It looks like there’s a huge benefit in wrapping most expressions in a function to maybe benefit from the optimizer or something?

Finally, by using @fastmath, I was able to just about halve the execution time of the inlined function. Where is @fastmath not appropriate to use? I figured in this case with a very simple computation I wouldn’t have to worry much about error creeping into the approximation, but I am not sure how this would fare if I were manipulating floats rather than integers.

``````function horner(poly, x)
result = poly[1]
for i in 1:length(poly)-1
result = result * x + poly[i+1]
end
return result
end

function horner2(poly, x)
result = poly[1]
n = length(poly)
for i in 1:n-1
result = result * x + poly[i+1]
end
return result
end

polywithpows(x) = 2x^4 - 2x^3 + 15x^2 + 12x - 6
polyexpanded(x) = 2*x*x*x*x - 2*x*x*x + 15*x*x + 12*x - 6
poly = [2, -2, 15, 12, -6];
x = 4;

@btime horner(poly, x)
#> 19.962 ns (1 allocation: 16 bytes)

@btime horner2(poly, x)
#> 19.597 ns (1 allocation: 16 bytes)

@btime foldl((acc, el) -> acc*x + el, poly[2:end], init=poly[1])
#> 2.161 μs (6 allocations: 208 bytes)

@btime polywithpows(x)
#> 18.328 ns (1 allocation: 16 bytes)

@btime polyexpanded(x)
#> 18.260 ns (1 allocation: 16 bytes)

@btime @fastmath 2x^4 - 2x^3 + 15x^2 + 12x - 6
#> 88.610 ns (3 allocations: 48 bytes)

@btime 2x^4 - 2x^3 + 15x^2 + 12x - 6
#> 142.256 ns (3 allocations: 48 bytes)
``````
1 Like

You should also try `Base.evalpoly`.
If your polynomial is constant, I’d also reccomend making it a `tuple` instead of a `Vector`.

``````julia> polyt = (2, -2, 15, 12, -6);

julia> @btime horner(\$poly, \$(Ref(x))[])
2.159 ns (0 allocations: 0 bytes)
666

julia> @btime horner(\$polyt, \$(Ref(x))[])
0.868 ns (0 allocations: 0 bytes)
666

julia> @btime evalpoly(\$(Ref(x))[], \$(reverse(polyt)))
0.868 ns (0 allocations: 0 bytes)
666

julia> @btime polyexpanded(\$(Ref(x))[])
0.867 ns (0 allocations: 0 bytes)
666

julia> @btime polywithpows(\$(Ref(x))[])
2.790 ns (0 allocations: 0 bytes)
666
``````

Is there additional overhead by including a `length(X)` in the range? Is it re-computed with each iteration, or is it only computed once?

No, no, yes.

Why exactly is that? It looks like there’s a huge benefit in wrapping most expressions in a function to maybe benefit from the optimizer or something?

Note that by interpolating arguments with `\$`, you can make the expression behave as though they’re known at compile time.
Another issue is that slicing in Julia creates a copy of that section of the array. You can create `view`s instead to avoid allocating a temporary array:

``````julia> @btime foldl((acc, el) -> acc*x + el, poly[2:end], init=poly[1])
1.226 μs (6 allocations: 208 bytes)
666

julia> @btime foldl((acc, el) -> acc*\$x + el, @view(\$poly[2:end]), init=\$poly[1])
4.510 ns (0 allocations: 0 bytes)
666
``````

Finally, by using @fastmath, I was able to just about halve the execution time of the inlined function. Where is @fastmath not appropriate to use?

`@fastmath` shouldn’t normally improve performance on integer code.

``````julia> polywithpows_fm(x) = @fastmath 2x^4 - 2x^3 + 15x^2 + 12x - 6
polywithpows_fm (generic function with 1 method)

julia> @btime polywithpows_fm(\$(Ref(x))[])
2.580 ns (0 allocations: 0 bytes)
666

julia> @btime polywithpows(\$(Ref(x))[])
2.784 ns (0 allocations: 0 bytes)
666
``````

`@fastmath` allows the optimizer to assume floating point code is associative. This isn’t actually true, so `@fastmath` will often change the rounding of your code.
For code written naively, this will probably actually make it more accurate more often than it’ll make it less accurate. This is because it will allow the use of fused multiply-add instructions that multiply, add, and then round. If you did them in separate operations, you’d have to multiply, round, add, round.
If you write a sum loop with `@fastmath` or `@simd`, and another without it, the first two will generally be more accurate for floating point numbers because they’ll use many separate accumulators in parallel to accelerate your sum. But, this changes the rounding and isn’t what you wrote, so the compiler needs your permission to do it.

The compiler needs your permission because many algorithms are not written naively w/ respect to floating point rounding. The most common examples are special functions, which are normally written with great care with respect to floating point rounding to try and achieve their target accuracy. Sometimes enabling fastmath can actually totally break the algorithm. For example, from the command line:

``````> julia -E 'sinpi.((1.2, 1.35, 1.5, 1.65, 1.8, 1.95, 2.1))'
(-0.587785252292473, -0.891006524188368, -1.0, -0.891006524188368, -0.587785252292473, -0.156434465040231, 0.3090169943749477)

> julia --math-mode=fast -E 'sinpi.((1.2, 1.35, 1.5, 1.65, 1.8, 1.95, 2.1))'
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
``````

Fast math also does more than this. It also allows assuming that there are no-nans and no-infs in the code.
It allows swapping out `inv(sqrt(x))` for an approximation + Newton Raphson iterations (that should achieve comparable accuracy).
Here is a description of the various optimizations fast-math enables.

Integer math actually is associative, so no such assumptions are needed for optimizing it.
More specifically, Julia won’t really change the code at all when using `@fastmath`.

14 Likes

Do you know whether there is a (simple) way to allow only some of these LLVM flags in Julia? For example, it can often be helpful to allow FMAs (MuladdMacro.jl could be used) plus some additional flags.

`llvmcall` is probably the simplest, followed by `LLVM.jl`.

``````@inline function add_contract(x::Float64, y::Float64)
Base.llvmcall("%res = fadd contract double %0, %1\nret double %res", Float64, Tuple{Float64,Float64}, x, y)
end
@inline function mul_contract(x::Float64, y::Float64)
Base.llvmcall("%res = fmul contract double %0, %1\nret double %res", Float64, Tuple{Float64,Float64}, x, y)
end
``````

Note that `llvmcall` is considered expensive by the inliner. It doesn’t know what your `llvmcall` is doing, so it picks an arbitrary (high) cost.
Use this, and many functions that were inlined before will no longer be inlined.
I get:

``````julia> add_contract(1.2, 2.3)
3.5

``````
``````define double @julia_add_contract_12886(double %0, double %1) {
top:
%res.i = fadd contract double %0, %1
ret double %res.i
}
``````
``````julia> mymuladd(a,b,c) = add_contract(mul_contract(a,b),c)
mymuladd (generic function with 1 method)

``````
``````        .text
vfmadd213sd     xmm0, xmm1, xmm2        # xmm0 = (xmm1 * xmm0) + xmm2
ret
nop     word ptr cs:[rax + rax]
``````
1 Like

Thanks!

1 Like

This is great information! Thank you!

Curious if there is a preferred way to convert a tuple to an array for those cases where I need to do so? I’ve been using the following approach for situations where the data types are consistent, but not sure if there’s a better way:

``````six_primes = (2, 3, 5, 7, 11, 13)
Int[six_primes...]
``````

Splatting small tuples is fine.
A couple alternatives include

``````julia> using StaticArrays

julia> six_primes = (2, 3, 5, 7, 11, 13);

julia> SVector(six_primes) # allocation free, but immutable
6-element SVector{6, Int64} with indices SOneTo(6):
2
3
5
7
11
13

julia> MVector(six_primes) # if you want to be able to mutate it
6-element MVector{6, Int64} with indices SOneTo(6):
2
3
5
7
11
13
``````

There’s also `StrideArrays`, which is much faster for some things, but is less well tested and (compared to `StaticArrays`) a lot of methods are missing.

3 Likes