Any suggestions for speeding this simple scalar function up?

Ahmed_Salih · August 20, 2021, 8:19pm

Hi!

I’ve been playing around calculating this expression:

P = b * ((rho/rho0)^gamma - 1)

Where values for b are shown how to be calculated in the code (see bottom spoiler). Running the different variants I have of the code, the minimum times I get:

Minimum Time EOS1:
Trial(62.999 μs)
Minimum Time EOS2:
Trial(68.600 μs)
Minimum Time EOS3:
Trial(69.401 μs)
Minimum Time EOS1AVX:
Trial(61.799 μs)
Minimum Time EOS1SIMILAR:
Trial(63.399 μs)

Where EOS1 which is written in the most “naive” way, seems to perform the best. For fun and curiosity, would anyone know how to further speed this up?

Thanks for reading.

Code

using BenchmarkTools
using LoopVectorization
rho   = rand(990:rand():1010,10^3)

function EOS1(rho;c0=100,gamma=7,rho0=1000)
    b=(c0^2*rho0)/gamma

    P     = b*((rho/rho0).^gamma .- 1)
end

function EOS1SIMILAR(rho;c0=100,gamma=7,rho0=1000)
    b=(c0^2*rho0)/gamma

    P     = similar(rho)

    P     .= b*((rho/rho0).^gamma .- 1)
end

function EOS2(rho;c0=100,gamma=7,rho0=1000)
    b=(c0^2*rho0)/gamma

    rho0_denominator = 1/(rho0^gamma);

    P     = b*(rho.^gamma * rho0_denominator .- 1)
end

function EOS3(rho;c0=100,gamma=7,rho0=1000)
    b=(c0^2*rho0)/gamma

    rho0_denominator = 1/(rho0^gamma);

    P     = rho.^gamma * rho0_denominator * b .- b
end

bEOS1  = @benchmark EOS1($rho)
println("Minimum Time EOS1:\n",bEOS1)
bEOS2  = @benchmark EOS2($rho)
println("Minimum Time EOS2:\n",bEOS2)
bEOS3  = @benchmark EOS3($rho)
println("Minimum Time EOS3:\n",bEOS3)

bEOS1AVX  = @benchmark @avx EOS1($rho)
println("Minimum Time EOS1AVX:\n",bEOS1AVX)
bEOS1SIMILAR  = @benchmark EOS1SIMILAR($rho)
println("Minimum Time EOS1SIMILAR:\n",bEOS1SIMILAR)

Sukera · August 20, 2021, 8:24pm

Use more dots. In all functions, only rho seems to be a vector, while all other objects are scalars. Dots in the same expression are only fused into a single loop if all operations that are touched are also dotted. If they’re seperated, the broadcast will have to materialize an array first by allocating and creating more loops than necessary.

Another way would be to use @..

E.g.

function EOS4(rho; c0=100, gamma=7, rho0=1000)
    b=(c0^2*rho0)/gamma

    rho0_denominator = 1/(rho0^gamma);

    return rho .^ gamma .* rho0_denominator .* b .- b
end

Ahmed_Salih · August 20, 2021, 8:31pm

Thanks for the suggestion.

When I try on my system with your function:

bEOS4  = @benchmark EOS4($rho)
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     67.399 μs (0.00% GC)
  median time:      69.799 μs (0.00% GC)
  mean time:        70.641 μs (0.00% GC)
  maximum time:     130.999 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Which is still ~10% slower than EOS1. I tried adding more “.” in EOS1 too, but did not affect performance.

Kind regards

lmiq · August 20, 2021, 9:04pm

You seem to have tried some avx variant, but is it there? Here @turbo makes a huge difference:

julia> using LoopVectorization

julia> function EOS1(rho;c0=100,gamma=7,rho0=1000)
           b=(c0^2*rho0)/gamma
           P = b*((rho/rho0).^gamma .- 1)
       end
EOS1 (generic function with 1 method)

julia> function EOS1_turbo(rho;c0=100,gamma=7,rho0=1000)
           b=(c0^2*rho0)/gamma
           P = @turbo b*((rho/rho0).^gamma .- 1)
       end
EOS1_turbo (generic function with 1 method)

julia> EOS1(rho) ≈ EOS1_turbo(rho)
true

julia> @btime EOS1($rho);
  68.043 μs (3 allocations: 23.81 KiB)

julia> @btime EOS1_turbo($rho);
  1.809 μs (3 allocations: 23.81 KiB)

If this will be called from within a hot loop, you probably want to preallocate b.

edit:

If you use @. to have better chances of not forgetting some loop fusion, that gets even better:

julia> function EOS1_turbo(rho;c0=100,gamma=7,rho0=1000)
           b=(c0^2*rho0)/gamma
           P = @turbo @. b*((rho/rho0)^gamma - 1)
       end
EOS1_turbo (generic function with 1 method)

julia> @btime EOS1_turbo($rho);
  701.667 ns (1 allocation: 7.94 KiB)

(the reason one would notice that that was needed is that there were 3 allocations in the previous version, where only P should be a new allocation there. Thus, some . is missing and an intermediate vector is being generated in the previous version, which should be: P = @turbo b .* ((rho ./ rho0).^gamma .- 1).

Ahmed_Salih · August 20, 2021, 9:17pm

Wow, you are indeed right, it has a huge impact! I also agree on using @., a lot less thinking of putting dots and gives better looking equation.

But some questions:

What is a hot loop?

A simple explanation of what @turbo does to make this much faster?

If you have time and want to.

Thanks!

stillyslalom · August 20, 2021, 9:20pm

@turbo has a lot of tricks up its sleeve, but faster integer exponentiation makes the biggest difference:

julia> pow(x, n) = x^n
pow (generic function with 1 method)

julia> @btime Base.map(x -> pow(x, 7), x) setup=(x=rand(1000));
  30.900 μs (1 allocation: 7.94 KiB)

julia> @btime LoopVectorization.vmap(x -> pow(x, 7), x) setup=(x=rand(1000));
  897.872 ns (1 allocation: 7.94 KiB)

lmiq · August 20, 2021, 9:22pm

If you have to call that same function thousand of times inside a loop, you will like to avoid creating a new P every time. Thus, create a variant of the function that receives a preallocated P and updates it.

@turbo Loop-vectorizes the operation, meaning that uses the capacity of the processors to perform more than one operation at once (search for SIMD or loop vectorization in google). When the operations are amenable to that, it is very good (sometimes that happens automatically, even without the @turbo flag, and sometimes you get a good speedup with @simd).

Now, how exactly in this specific case the speedup is that great, I think some expert should kick in. Normally I would expect something of the order of 4x to 8x speedup.

lmiq · August 20, 2021, 9:23pm

Ah, well noticed. Then you should probably take a look at: GitHub - JuliaMath/FastPow.jl: optimal addition-chain exponentiation for Julia

(for some reason I don’t know, it is not affecting the result here:)

julia> function EOS1_fastpow(rho;c0=100,gamma=7,rho0=1000)
           b=(c0^2*rho0)/gamma
           @fastpow P = @. b*((rho/rho0)^gamma - 1)
       end
EOS1_fastpow (generic function with 1 method)

julia> @btime EOS1_fastpow($rho);
  67.417 μs (1 allocation: 7.94 KiB)

(tried moving the macros to see if the result changes, with no success)

Ahmed_Salih · August 20, 2021, 9:37pm

Thanks for the observation @stillyslalom !

I also tried using FastPow and pow(x,n), but unfortunately the first one I could not get to work as you mentioned Leandro and the other one made it slower with @turbo @.

Thanks for the explanation of hot loop too !

Kind regards

CRquantum · August 21, 2021, 8:18am

Sorry for interrupting a stupid question, if I do not use @ turbo trick for those .+ .- .* ./ .^2 etc stuff, will the speed be slow down a lot?
if I do not use these tricks, by default, what is Julia’s optimization level? like equivalent to Fortran’s O2 or something?

Sukera · August 21, 2021, 9:12am

Julia and LLVM both do loop unrolling and SIMD vectorization to a certain extent, but @turbo by default takes more knowledge about the current CPU into account. It’s also much more aggressive and (with wrong use) can be unsafe. How much @turbo will help in that regard is not really possible to pin down to a specific number (it depends on the loop in question) - it’s just that through its aggressive nature, you can usually see a large speedup compared to base julia and its LLVM passes (as far as I know, mostly because @turbo requires independent loop iterations, whereas both LLVM and julia would have to prove that on their own and can’t just assume it).

While julia itself does have “optimization levels” as a command line flag, they pretty much don’t do anything nowadays as far as I’m aware.

DNF · August 21, 2021, 9:13am

You can shave off a bit of extra time (5%-15%) by removing one division:

function EOS2_turbo(ρ; c0=100, γ=7, ρ0=1000)
    b1 = (c0^2 * ρ0) / γ
    b2 = 1 / ρ0^γ
    P = @turbo b2 .* ρ.^γ .- b1
    return P
end

jl> @btime EOS1_turbo($ρ);
  638.608 ns (1 allocation: 7.94 KiB)

jl> @btime EOS2_turbo($ρ);
  601.775 ns (1 allocation: 7.94 KiB)

(Sorry about changing your symbols, but I can no longer stand seeing variable names like gamma )

Edit: There’s an error in EOS2_turbo, but I can’t find what it is. The values are extremely wrong, though.

Update: My suggestion is apparently numerically abhorrent:

jl> ρ = 990.1
990.1

jl> ρ^γ / ρ0^γ
240.65219205640526

jl> (ρ/ρ0)^γ
0.9327245837531053

So don’t use it!

giordano · August 21, 2021, 9:42am

What if gamma and rho0 are floats instead of integers that can overflow?

Ahmed_Salih · August 21, 2021, 9:44am

Haha no problem - I also like to use unicode symbols, just not for posting questions on Discourse generally

Interesting speed up, but weird behaviour. It seems like the division and ^ takes the same precendence, so it goes from left to right:

julia> 990.1^7 / 1000^7
240.65219205640526

julia> (990.1/1000)^7
0.9327245837531053

julia> (990.1^7) / (1000^7)
240.65219205640526

So the error is that it is calculating:

(ρ^γ / ρ0)^γ

But that gives:

(990.1^7/1000)^7
6.14149305499489e125

So I am a bit my confused self now too

lmiq · August 21, 2021, 9:45am

In this particular case I would bet that turbo is much more agressive than any Fortran compiler. You can try it. AFAIK you can get more agressive loop optimizations with --march=native in Fortran. (you have asked that in the Fortran forum, by the way)

Ahmed_Salih · August 21, 2021, 9:56am

rho0 can be a float value, I have changed it to 1000.0

gamma can too in theory, but most often in my cases it would just be “7”. I found using “7” instead of “7.0” sped up the power calculation.

Kind regards

DNF · August 21, 2021, 10:32am

It’s not about precedence, but a numerical issue. It works ok up to 6, and fails for 7.

DNF · August 21, 2021, 10:35am

Probably integer overflow for 1000^7

Sukera · August 21, 2021, 10:39am

Indeed:

julia> Base.Checked.checked_mul(1000^6, 1000)                             
ERROR: OverflowError: 1000000000000000000 * 1000 overflowed for type Int64
Stacktrace:                                                               
 [1] throw_overflowerr_binaryop(op::Symbol, x::Int64, y::Int64)           
   @ Base.Checked ./checked.jl:154                                        
 [2] checked_mul(x::Int64, y::Int64)                                      
   @ Base.Checked ./checked.jl:288                                        
 [3] top-level scope                                                      
   @ REPL[14]:1

Ahmed_Salih · August 21, 2021, 10:41am

And @DNF ,does that mean I should always use floats for this? I.e. 1000.0^7?

Or how to interpret this result.

EDIT: Actually it is not as such a problem for my equation, since it just does (rho/rho0)^7, so no overflow. But it means I cannot get the speed up from calculating 1/rho0^7 first I suppose.

Kind regards