Improving Computational Performance

nullmap · June 20, 2021, 12:28am

I have three questions and I was curious if someone can provide some clarity on the results I’m seeing with a couple variations of a simple polynomial evaluation computation.

I wrote a Horner’s method implementation which is performing well, but I was curious about the for loop. Is there additional overhead by including a length(X) in the range? Is it re-computed with each iteration, or is it only computed once? I am not seeing any noticeable difference with the benchmark between horner and horner2 below so I’m assuming it isn’t.

Also, I included the @btime values below in the code snippet and I’m seeing a huge improvement in the execution time between wrapping the poly call in a function compared to inlining the expression. Why exactly is that? It looks like there’s a huge benefit in wrapping most expressions in a function to maybe benefit from the optimizer or something?

Finally, by using @fastmath, I was able to just about halve the execution time of the inlined function. Where is @fastmath not appropriate to use? I figured in this case with a very simple computation I wouldn’t have to worry much about error creeping into the approximation, but I am not sure how this would fare if I were manipulating floats rather than integers.

Thanks in advance!

function horner(poly, x)
    result = poly[1]
    for i in 1:length(poly)-1
        result = result * x + poly[i+1]
    end
    return result
end

function horner2(poly, x)
    result = poly[1]
    n = length(poly)
    for i in 1:n-1
        result = result * x + poly[i+1]
    end
    return result
end

polywithpows(x) = 2x^4 - 2x^3 + 15x^2 + 12x - 6
polyexpanded(x) = 2*x*x*x*x - 2*x*x*x + 15*x*x + 12*x - 6
poly = [2, -2, 15, 12, -6]; 
x = 4;

@btime horner(poly, x)
#> 19.962 ns (1 allocation: 16 bytes)

@btime horner2(poly, x)
#> 19.597 ns (1 allocation: 16 bytes)

@btime foldl((acc, el) -> acc*x + el, poly[2:end], init=poly[1])
#> 2.161 μs (6 allocations: 208 bytes)

@btime polywithpows(x)
#> 18.328 ns (1 allocation: 16 bytes)

@btime polyexpanded(x)
#> 18.260 ns (1 allocation: 16 bytes)

@btime @fastmath 2x^4 - 2x^3 + 15x^2 + 12x - 6
#> 88.610 ns (3 allocations: 48 bytes)

@btime 2x^4 - 2x^3 + 15x^2 + 12x - 6
#> 142.256 ns (3 allocations: 48 bytes)

Elrod · June 20, 2021, 4:47am

You should also try Base.evalpoly.
If your polynomial is constant, I’d also reccomend making it a tuple instead of a Vector.

julia> polyt = (2, -2, 15, 12, -6);

julia> @btime horner($poly, $(Ref(x))[])
  2.159 ns (0 allocations: 0 bytes)
666

julia> @btime horner($polyt, $(Ref(x))[])
  0.868 ns (0 allocations: 0 bytes)
666

julia> @btime evalpoly($(Ref(x))[], $(reverse(polyt)))
  0.868 ns (0 allocations: 0 bytes)
666

julia> @btime polyexpanded($(Ref(x))[])
  0.867 ns (0 allocations: 0 bytes)
666

julia> @btime polywithpows($(Ref(x))[])
  2.790 ns (0 allocations: 0 bytes)
666

Is there additional overhead by including a length(X) in the range? Is it re-computed with each iteration, or is it only computed once?

No, no, yes.

Why exactly is that? It looks like there’s a huge benefit in wrapping most expressions in a function to maybe benefit from the optimizer or something?

Note that by interpolating arguments with $, you can make the expression behave as though they’re known at compile time.
Another issue is that slicing in Julia creates a copy of that section of the array. You can create views instead to avoid allocating a temporary array:

julia> @btime foldl((acc, el) -> acc*x + el, poly[2:end], init=poly[1])
  1.226 μs (6 allocations: 208 bytes)
666

julia> @btime foldl((acc, el) -> acc*$x + el, @view($poly[2:end]), init=$poly[1])
  4.510 ns (0 allocations: 0 bytes)
666

Finally, by using @fastmath, I was able to just about halve the execution time of the inlined function. Where is @fastmath not appropriate to use?

@fastmath shouldn’t normally improve performance on integer code.

julia> polywithpows_fm(x) = @fastmath 2x^4 - 2x^3 + 15x^2 + 12x - 6
polywithpows_fm (generic function with 1 method)

julia> @btime polywithpows_fm($(Ref(x))[])
  2.580 ns (0 allocations: 0 bytes)
666

julia> @btime polywithpows($(Ref(x))[])
  2.784 ns (0 allocations: 0 bytes)
666

@fastmath allows the optimizer to assume floating point code is associative. This isn’t actually true, so @fastmath will often change the rounding of your code.
For code written naively, this will probably actually make it more accurate more often than it’ll make it less accurate. This is because it will allow the use of fused multiply-add instructions that multiply, add, and then round. If you did them in separate operations, you’d have to multiply, round, add, round.
If you write a sum loop with @fastmath or @simd, and another without it, the first two will generally be more accurate for floating point numbers because they’ll use many separate accumulators in parallel to accelerate your sum. But, this changes the rounding and isn’t what you wrote, so the compiler needs your permission to do it.

The compiler needs your permission because many algorithms are not written naively w/ respect to floating point rounding. The most common examples are special functions, which are normally written with great care with respect to floating point rounding to try and achieve their target accuracy. Sometimes enabling fastmath can actually totally break the algorithm. For example, from the command line:

> julia -E 'sinpi.((1.2, 1.35, 1.5, 1.65, 1.8, 1.95, 2.1))'
(-0.587785252292473, -0.891006524188368, -1.0, -0.891006524188368, -0.587785252292473, -0.156434465040231, 0.3090169943749477)

> julia --math-mode=fast -E 'sinpi.((1.2, 1.35, 1.5, 1.65, 1.8, 1.95, 2.1))'
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)

Fast math also does more than this. It also allows assuming that there are no-nans and no-infs in the code.
It allows swapping out inv(sqrt(x)) for an approximation + Newton Raphson iterations (that should achieve comparable accuracy).
Here is a description of the various optimizations fast-math enables.

Integer math actually is associative, so no such assumptions are needed for optimizing it.
More specifically, Julia won’t really change the code at all when using @fastmath.

ranocha · June 20, 2021, 5:53am

Do you know whether there is a (simple) way to allow only some of these LLVM flags in Julia? For example, it can often be helpful to allow FMAs (MuladdMacro.jl could be used) plus some additional flags.

Elrod · June 20, 2021, 10:39am

llvmcall is probably the simplest, followed by LLVM.jl.

@inline function add_contract(x::Float64, y::Float64)
    Base.llvmcall("%res = fadd contract double %0, %1\nret double %res", Float64, Tuple{Float64,Float64}, x, y)
end
@inline function mul_contract(x::Float64, y::Float64)
    Base.llvmcall("%res = fmul contract double %0, %1\nret double %res", Float64, Tuple{Float64,Float64}, x, y)
end

Note that llvmcall is considered expensive by the inliner. It doesn’t know what your llvmcall is doing, so it picks an arbitrary (high) cost.
Use this, and many functions that were inlined before will no longer be inlined.
I get:

julia> add_contract(1.2, 2.3)
3.5

julia> @code_llvm debuginfo=:none add_contract(1.2, 2.3)

define double @julia_add_contract_12886(double %0, double %1) {
top:
  %res.i = fadd contract double %0, %1
  ret double %res.i
}

julia> mymuladd(a,b,c) = add_contract(mul_contract(a,b),c)
mymuladd (generic function with 1 method)

julia> @code_native syntax=:intel debuginfo=:none mymuladd(2.3,4.5,6.7)

        .text
        vfmadd213sd     xmm0, xmm1, xmm2        # xmm0 = (xmm1 * xmm0) + xmm2
        ret
        nop     word ptr cs:[rax + rax]

ranocha · June 20, 2021, 4:05pm

Thanks!

nullmap · June 20, 2021, 9:55pm

This is great information! Thank you!

Curious if there is a preferred way to convert a tuple to an array for those cases where I need to do so? I’ve been using the following approach for situations where the data types are consistent, but not sure if there’s a better way:

six_primes = (2, 3, 5, 7, 11, 13)
Int[six_primes...]

Elrod · June 20, 2021, 10:13pm

Splatting small tuples is fine.
A couple alternatives include

julia> using StaticArrays

julia> six_primes = (2, 3, 5, 7, 11, 13);

julia> SVector(six_primes) # allocation free, but immutable
6-element SVector{6, Int64} with indices SOneTo(6):
  2
  3
  5
  7
 11
 13

julia> MVector(six_primes) # if you want to be able to mutate it
6-element MVector{6, Int64} with indices SOneTo(6):
  2
  3
  5
  7
 11
 13

There’s also StrideArrays, which is much faster for some things, but is less well tested and (compared to StaticArrays) a lot of methods are missing.

Topic		Replies	Views
I have a much faster version of evalpoly. Why is it faster? Performance	14	994	May 9, 2022
Why are calls to this function slow, compared to Python+Numpy? General Usage fast-math	14	2115	August 13, 2017
@fastmath macro accuracy General Usage numerics , fast-math	26	11071	May 12, 2020
@fastmath switched off in Julia 0.7-Dev, deliberately? Performance fast-math	7	1747	June 24, 2018
Fma's and poly eval Performance performance , polynomials	4	760	May 19, 2021

Improving Computational Performance

Related topics