Speeding up trig-heavy code

I tried this, but did not see much of a speedup. @fastmath also breaks correctness, unfortunately.

I’m not sure what the application is like or the best way to apply it, but if you’re bottlenecked by sincos it should be able to help a lot.

julia> using LoopVectorization, BenchmarkTools

julia> function f!(f, a, b, x)
           @inbounds for i in eachindex(a,b,x)
               a[i], b[i] = f(x[i])
           end
       end
f! (generic function with 1 method)

julia> function fturbo!(f, a, b, x)
           @turbo for i in eachindex(a,b,x)
               a[i], b[i] = f(x[i])
           end
       end
fturbo! (generic function with 1 method)

julia> x = rand(1000); a = similar(x); b = similar(x);

julia> mysincos(x) = (sin(x),cos(x))
mysincos (generic function with 1 method)

julia> @btime f!(sincos, $a, $b, $x);
  7.032 μs (0 allocations: 0 bytes)

julia> @btime f!(mysincos, $a, $b, $x);
  8.959 μs (0 allocations: 0 bytes)

julia> @btime fturbo!(sincos, $a, $b, $x);
  1.032 μs (0 allocations: 0 bytes)

julia> @btime fturbo!(mysincos, $a, $b, $x);
  1.699 μs (0 allocations: 0 bytes)

This was on a 10980XE with AVX512.

What do you mean by “breaks correctness” here? Do you need 1ULP accuracy? There are more accurate SIMD implementations you can use, but LV will use 3ULP implementations by default (same as fastmath).

3 Likes

I think it will be easier to just make a PR and look at the benchmarks and tests to see what happens. I will try to do that soon.

I likewise have some trig heavy code so thought I would experiment using an approach similar to @Elrod.

From this I conclude that the fast function in the SLEEF library are marginally faster than the built-in functions (including the @fastmathfastmath functions) and are not speeded up with LoopVectorization. The examples of the sinc function and the SLEEF library shows that LoopVectorization does not speed these functions.

I am curious as to why this is the case.

julia> function f!(f, y, x)
           @inbounds for i in eachindex(y,x)
                y[i] = f(x[i])
           end
       end
f! (generic function with 2 methods)

julia> function ff!(f, y, x)
           @inbounds @fastmath for i in eachindex(y,x)
                y[i] = f(x[i])
           end
       end
ff! (generic function with 2 methods)

julia> function ft!(f, y, x)
           @turbo for i in eachindex(y,x)
               y[i] = f(x[i])
           end
       end
ft! (generic function with 1 method)

julia> @btime f!(sin, $ys, $xs)
  130.200 μs (0 allocations: 0 bytes)

julia> @btime ff!(sin, $ys, $xs)
  131.700 μs (0 allocations: 0 bytes)

julia> @btime f!(SLEEF.sin, $ys, $xs)
  206.400 μs (0 allocations: 0 bytes)

julia> @btime f!(SLEEF.sin_fast, $ys, $xs)
  125.500 μs (0 allocations: 0 bytes)

julia> @btime ft!(sin, $ys, $xs)
  16.400 μs (0 allocations: 0 bytes)

julia> @btime ft!(SLEEF.sin, $ys, $xs)
┌ Warning: #= REPL[241]:2 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ Main C:\Users\jakez\.julia\packages\LoopVectorization\GKxH5\src\condense_loopset.jl:1166
  203.800 μs (0 allocations: 0 bytes)

julia> @btime ft!(SLEEF.sin_fast, $ys, $xs)
  122.600 μs (0 allocations: 0 bytes)

julia> @btime f!(SLEEFPirates.sin, $ys, $xs)
  47.200 μs (0 allocations: 0 bytes)julia>

julia> @btime f!(SLEEFPirates.sin_fast, $ys, $xs)
  21.800 μs (0 allocations: 0 bytes)

EDIT: Added the SLEEFPirates timings to the above for comparison purposes based on the comments from the post below.

1 Like

You can also try callsite @inline; on a laptop:

using BenchmarkTools, OhMyREPL

julia> using SLEEF

julia> function finline!(f, y, x)
           @inbounds for i in eachindex(y,x)
                y[i] = @inline f(x[i])
           end
       end
finline! (generic function with 1 method)

julia> function f!(f, y, x)
           @inbounds for i in eachindex(y,x)
                y[i] = f(x[i])
           end
       end
f! (generic function with 1 method)

julia> xs=rand(1000);ys=similar(xs);

julia> @btime f!(sin, $ys, $xs)
  3.083 μs (0 allocations: 0 bytes)

julia> @btime f!(Base.FastMath.sin_fast, $ys, $xs)
  3.202 μs (0 allocations: 0 bytes)

julia> @btime f!(SLEEF.sin, $ys, $xs)
  18.633 μs (0 allocations: 0 bytes)

julia> @btime f!(SLEEF.sin_fast, $ys, $xs)
  7.367 μs (0 allocations: 0 bytes)

julia> @btime finline!(sin, $ys, $xs)
  2.406 μs (0 allocations: 0 bytes)

julia> @btime finline!(Base.FastMath.sin_fast, $ys, $xs)
  3.052 μs (0 allocations: 0 bytes)

julia> @btime finline!(SLEEF.sin, $ys, $xs)
  3.136 μs (0 allocations: 0 bytes)

julia> @btime finline!(SLEEF.sin_fast, $ys, $xs)
  913.054 ns (0 allocations: 0 bytes)

This is necessary to get decent/good performance from SLEEF.jl.
SLEEFPirates.jl is similar to SLEEF.jl, but added @inline on the function definitions directly.
This is what LoopVectorization.jl uses, so performance may be similar:

julia> using LoopVectorization

julia> function ft!(f, y, x)
           @turbo for i in eachindex(y,x)
               y[i] = f(x[i])
           end
       end
ft! (generic function with 1 method)

julia> @btime ft!(sin, $ys, $xs)
  848.103 ns (0 allocations: 0 bytes)

julia> using SLEEFPirates

julia> @btime f!(SLEEFPirates.sin_fast, $ys, $xs)
  976.812 ns (0 allocations: 0 bytes)

Now trying to extend this performance gain to sinc I get the following results.

julia> sleefsinc(x) = x == 0 ? 1.0 : SLEEFPirates.sin(π*x)/(π*x)
sleefsinc (generic function with 1 method)
julia> sleefsincfast(x) = x == 0 ? 1.0 : SLEEFPirates.sin_fast(π*x)/(π*x)
sleefsincfast (generic function with 1 method)

julia> @btime f!(sinc, $ys, $xs)
  133.800 μs (0 allocations: 0 bytes)

julia> @btime ff!(sinc, $ys, $xs)  # fastmath version
  133.000 μs (0 allocations: 0 bytes)

julia> @btime ft!(sinc, $ys, $xs)  # @turbo version
  145.100 μs (0 allocations: 0 bytes)

julia> @btime f!(sleefsinc, $ys, $xs)
  210.500 μs (0 allocations: 0 bytes)

julia> @btime f!(sleefsincfast, $ys, $xs)
  127.200 μs (0 allocations: 0 bytes)

I am assuming the test for x == 0 is killing the performance here?

I would be cautious about the relevance of benchmarking a single function acting on an array in a loop by itself. Ideally, you should be combining as many calculations as possible into a single pass over your data, and that can change the performance tradeoffs considerably.

5 Likes