Optimising superbee! function

alemelis · April 5, 2024, 11:56am

Hello, I’m profiling my simulation with profview and this is the flame graph

where I’m pointing to the four function calls that seem to be among the most expensive. Those are calling the same function – superbee! – in different parts of the simulation. The function reads

maxmod(a::Float64, b::Float64) = 0.5*(sign(a) + sign(b))*max(abs(a), abs(b))
minmod(a::Float64, b::Float64) = 0.5*(sign(a) + sign(b))*min(abs(a), abs(b))

function superbee!(s::T, a::T, b::T, h::Float64) where {T>:Vector{Float64}}
    for i=eachindex(s)
        @inbounds s[i] = maxmod(minmod(a[i], 2.0*b[i]), minmod(2.0*a[i],b[i]))*h
    end
end

where all the input vectors are pre-allocated and usually of size ~100 (the size is know only at runtime; it can get larger, up to 500). Benchmarking this I get

julia> n=100;s=rand(n);a=rand(n);b=rand(n);h=rand();
julia> @benchmark superbee!($s, $a, $b, $h)
BenchmarkTools.Trial: 10000 samples with 451 evaluations.
 Range (min … max):  227.541 ns … 716.091 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     229.091 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   242.514 ns ±  37.856 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂▁  ▂ ▁                                                      ▁
  ████▇█▇██▇█▇▇███▇▇▆▆▅▆▆▇▇▇▆▆▆▇▆▆▆▇▆▇▇▇▇▇▇▇▇▆▇▆▅▆▅▆▄▅▅▅▅▅▃▄▄▅▄ █
  228 ns        Histogram: log(frequency) by time        403 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Nice, no allocs!

(Btw, I also tried to simply use broadcasting)

function superbeeb!(s::T, a::T, b::T, h::Float64) where {T>:Vector{Float64}}
    @. s = maxmod(minmod(a, 2.0 *b), minmod(2.0 *a, b)) * h
end

but this is slower than the for-loop one

julia> @benchmark superbeeb!($s, $a, $b, $h)
BenchmarkTools.Trial: 10000 samples with 290 evaluations.
 Range (min … max):  282.717 ns … 877.279 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     284.676 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   294.799 ns ±  35.806 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▂▁  ▂  ▁                                                    ▁
  ███████▇▇█▆▆▇▇▇▇▇▇▇▇▅▆▆▆▆▅▆▅▅▆▅▆▆▅▅▅▅▄▅▆▆▅▅▅▄▄▅▅▅▅▅▄▅▅▅▄▅▄▄▄▄ █
  283 ns        Histogram: log(frequency) by time        464 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Then I noticed that most of the time is spent calling the sign function as we have minmod twice.
Screenshot 2024-04-05 at 12.45.40

However sign doesn’t care about the difference between a and 2.0*a, so I refactored as

function superbee_refactor!(s::T, a::T, b::T, h::Float64) where {T>:Vector{Float64}}
    signab = 0.0
    absa = 0.0
    absb =  0.0
    minmoda2b = 0.0
    minmod2ab = 0.0
    for i = eachindex(s)
        @inbounds signab = 0.5 * (sign(a[i]) + sign(b[i]))

        @inbounds absa = abs(a[i])
        @inbounds absb = abs(b[i])
        minmoda2b = signab * min(absa, 2.0*absb)
        minmod2ab = signab * min(2.0*absa, absb)

        @inbounds s[i] = maxmod(signab*min(absa,2.0*absb),signab*min(2.0*absa,absb))*h
    end
end

resulting in

julia> @benchmark superbee_refactor!($s, $a, $b, $h)
BenchmarkTools.Trial: 10000 samples with 661 evaluations.
 Range (min … max):  182.625 ns … 529.581 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     185.082 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   191.468 ns ±  22.264 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▃▁   ▁  ▁                                                   ▁
  █████████▇███▇███▇▇▇▇▆▆▆▆▆▆▆▅▆▆▅▅▅▆▆▆▅▆▅▅▅▆▄▄▄▆▅▅▅▄▄▃▅▅▅▅▄▄▃▅ █
  183 ns        Histogram: log(frequency) by time        301 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

This required a lot of pre-allocations but it seems to be worth the effort.

Where to go now? I tried @simd but no luck. Multi-threading as well is not helpful with such a small number of elements in the vectors.

Would StaticArrays help here? Even though I may have >100 elements at times?

JM_Beckers · April 5, 2024, 12:28pm

function superbee_refactor!(s::T, a::T, b::T, h::Float64) where {T>:Vector{Float64}}
    signab = 0.0
    absa = 0.0
    absb =  0.0
    minmoda2b = 0.0
    minmod2ab = 0.0
    for i = eachindex(s)
        @inbounds s[i]=0
        
        if a[i]>0
            if b[i]>0
               s[i]=max(min(a[i],2.0*b[i]),min(2.0*a[i],b[i]))*h
            end
        else
            if b[i]<0
               s[i]=-max(min(-a[i],-2.0*b[i]),min(-2.0*a[i],-b[i]))*h  
            end
        end
        
    end
end

Brings
Range (min … max): 197.931 ns … 493.793 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 212.241 ns ┊ GC (median): 0.00%
Time (mean ± σ): 228.319 ns ± 50.704 ns ┊ GC (mean ± σ): 0.00% ± 0.00%

▄▂█▄▇▅▃▅▁ ▁ ▂▂▁▁▁▁▁ ▁
███████████▇▇▇▇▇▇▇▇▆▆▆▅▆▅▅▄▄▆▄▆▅▄▄▄▅▅▄▂▃▄▅▄▄▅▆▇▅███▇███████▇█ █
198 ns Histogram: log(frequency) by time 394 ns <

down from the original superbee! function

BenchmarkTools.Trial: 10000 samples with 346 evaluations.
Range (min … max): 268.208 ns … 1.861 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 295.087 ns ┊ GC (median): 0.00%
Time (mean ± σ): 348.997 ns ± 84.694 ns ┊ GC (mean ± σ): 0.00% ± 0.00%

▅ ██▅▇▆▅▄▄▃▂▂▁ ▂ ▁▁ ▁▅▇▅▄▃▁▄▅▅▃▃ ▃
█▄████████████████▇▇▇█▇▇▇▇▇▇▆█▇█████▇█▆▆▇████████████████▇▆▇ █
268 ns Histogram: log(frequency) by time 498 ns <

Ahmed_Salih · April 5, 2024, 1:07pm

I had the same misconception, so I just wanted to clarify this for you. Using StaticArrays is completely fine for this! You are right that you have 100 elements but you do not have an element with 100 fields.

Therefore you can (and in my opinion should) use StaticArrays, since as far as I can see you would have to use an SVector{Dimensions,FloatSomething}

Kind regards

gdalle · April 5, 2024, 1:18pm

I don’t think StaticArrays will help here because the bottleneck are mathematical operations

maxfreu · April 5, 2024, 1:21pm

minmoda2b and minmod2ab are unused in your refactoring, if I see that correctly. So leaving them out would speed things up, if the compiler didn’t do so already. Also you can do @inbounds for ... and get rid of the individual macro calls. But that’s just convenience. Plus I think @inbounds should be inferable by the compiler, as you’re using eachindex. Apart from that… you don’t need to zero signab etc, because they are defined in the loop. You don’t have to “preallocate” single numbers. And you can add @inline in front of maxmod, maybe that helps just a little.

abraemer · April 5, 2024, 1:22pm

Here is a faster version based on @JM_Beckers version. I removed the branches and put @fastmath to enable SIMD, put all the arrays into eachindex such that @inbounds is inferred automatically and took care that nothing is read twice. This give approx. a factor 3 in speed on my machine

function superbee_refactor2!(s::T, a::T, b::T, h::Float64) where {T>:Vector{Float64}}
    @fastmath for i = eachindex(s,a,b)
        # @inbounds is inferred automatically - yay for safety AND speed!
        ai = a[i]
        bi = b[i]
        t1 = max(min(ai,2.0*bi),min(2.0*ai,bi))
        t2 = -max(min(-ai,-2.0*bi),min(-2.0*ai,-bi))
        s[i] = ifelse(ai>0 && bi>0, t1, ifelse(bi<0, t2, zero(eltype(T))))*h
    end
end

Timings:

julia> @benchmark superbee_refactor!($s,$a,$b,$h)
BenchmarkTools.Trial: 10000 samples with 870 evaluations.
 Range (min … max):  135.189 ns … 215.948 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     137.115 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   138.965 ns ±   5.731 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂▆█▇▅▃▁▂▂▁▂▃▂▁ ▁▁                                            ▂
  ▇███████████████████████████▇▇▇▆▇▇▇▇▇▇▆▇▆▆▆▆▆▄▅▅▅▅▄▅▅▄▄▅▃▄▃▂▅ █
  135 ns        Histogram: log(frequency) by time        164 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark superbee_refactor2!($s,$a,$b,$h)
BenchmarkTools.Trial: 10000 samples with 990 evaluations.
 Range (min … max):  42.823 ns … 114.853 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     44.516 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.762 ns ±   5.145 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▂▆█▇▅▃▁                                   ▁▁               ▁
  ▄█████████▇▇▇▇██▇▅▅▅▃▂▃▂▂▂▄▄▅▅▅▄▅▅▅▅▃▅▅▅▆▇▇███▇█▇▅▆▆▆▄▆▆▅▅▆▅ █
  42.8 ns       Histogram: log(frequency) by time      63.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

EDIT: Moved around the multiplication with h to gain another 2ns speedup

Ahmed_Salih · April 5, 2024, 1:34pm

Out of my own curiosity, I notice that 2.0*ai is calculated twice with opposite signs. Does Julia automatically realize this too, or would calculate it once, then switching signs be faster?

Oscar_Smith · April 5, 2024, 1:38pm

it probably would figure this out (since multiplication is simple), but I wouldn’t count on it.

abraemer · April 5, 2024, 1:40pm

I don’t think Julia notices this typically. This is called “common subexpression elimination” (CSE) and per this thread Julia typically does not perform it. In this case maybe because if you check the assembly then everything is fully inlined and LLVM does CSE so there is a chance. I am not able to decipher the assembly well enough to see whether the multiplications are performed twice.

abraemer · April 5, 2024, 1:42pm

Actually one can optimize this a bit further by using the identity -max(a,b) == min(-a,-b):

function superbee_refactor3!(s::T, a::T, b::T, h::Float64) where {T>:Vector{Float64}}
    @fastmath for i = eachindex(s,a,b)
        # @inbounds is inferred automatically - yay for safety AND speed!
        ai = a[i]
        bi = b[i]
        t1 = max(min(ai,2bi),min(2ai,bi))
        #t2 = -max(min(-ai,-2bi),min(-2ai,-bi))
        t2 = min(max(ai,2bi),max(2ai,bi))
        s[i] = ifelse(ai>0 && bi>0, t1, ifelse(bi<0, t2, zero(eltype(T))))*h
    end
end

This shaves off another ~8ns (~20%) for me:

julia> @benchmark superbee_refactor3!($s,$a,$b,$h)
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  34.887 ns … 55.143 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     36.574 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.939 ns ±  1.519 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁ ▂▅▄█▇▅▇▁▂▁ ▁ ▁▂ ▁  ▂ ▂  ▁                              ▂
  ▅▅▄█▇██████████▅█▇██▅███████▇█▆▆▆▃▆▆▅▆▄▆▆▃▅▃▅▅▁▃▃▄▃▄▁▃▅▄▄▅▄ █
  34.9 ns      Histogram: log(frequency) by time      45.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

EDIT: I also tried to use the minmax function but this breaks something and it is a lot slower. ~~I am not sure what the difference is as the assembly looks very similar superficially.~~ This can for some reason not use AVX instructions (256-bit) and only uses SSE instructions (128-bit).

function superbee_refactor4!(s::T, a::T, b::T, h::Float64) where {T>:Vector{Float64}}
    @fastmath @simd for i = eachindex(s,a,b)
        # @inbounds is inferred automatically - yay for safety AND speed!
        ai = a[i]
        bi = b[i]
        # t1 = max(min(ai,2.0*bi),min(2.0*ai,bi))*h
        # t2 = min(max(ai,2.0*bi),max(2.0*ai,bi))*h
        m1,m2 = minmax(ai,2.0*bi)
        m3,m4 = minmax(2.0*ai,2.0)
        t1 = max(m1,m3)
        t2 = min(m2,m4)
        s[i] = ifelse(ai>0 && bi>0, t1, ifelse(bi<0, t2, zero(eltype(T))))*h
    end
end

julia> @benchmark superbee_refactor4!($s,$a,$b,$h)
BenchmarkTools.Trial: 10000 samples with 803 evaluations.
 Range (min … max):  154.908 ns … 518.042 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     156.822 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   158.300 ns ±  11.322 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂▂▆▇█▆▅▃▁▂▁▁▁ ▂▂▃▃▁                                          ▂
  █████████████████████▆▆▇▆▆▆▄▆▄▅▅▅▇▆▅▅▆▇▆▇▆▇▇▆▇▇▆▇▆▆▇▆▇▆▄▆▆▆▁▆ █
  155 ns        Histogram: log(frequency) by time        177 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

abraemer · April 5, 2024, 2:01pm

I stared at the assembly some more and I think it does the multiplication twice in this case - but in different ways funnily enough ^^
The 2ai become essentially ai+ai, while the -2ai are actually executed as multiplication (-2)*ai.

In the newest version superbee_refactor3!, the 2ai and 2bi are computed just once. So CSE eliminated that at least

alemelis · April 5, 2024, 2:13pm

wow!

one thing tho, not sure this is correct

s[i] = ifelse(ai>0 && bi>0, t1, ifelse(bi<0, t2, zero(eltype(T))))*h

shouldn’t be

s[i] = ifelse(ai>0, ifelse(bi>0, t1, zero(eltype(T))), ifelse(bi<0, t2, zero(eltype(T))))*h

instead?

Perhaps

s[i] = ifelse(ai>0<bi, t1, ifelse(ai<=0>bi, t2, zero(eltype(T))))*h

abraemer · April 5, 2024, 2:44pm

You are right there is a difference and your second version is more correct.

Now that I think about it: Can it happen that one of ai or bi is exactly zero? Because the sign function returns 0 in that case. We should check that the result is correct in these cases as well.

alemelis · April 5, 2024, 2:45pm

Also from the @fastmath docs

This allows
the fastest possible operation, but results are undefined – be careful when doing this, as it may change numerical results.

can you explain what undefined means? Is the calculation going to be non-deterministic somehow?

alemelis · April 5, 2024, 2:47pm

yes, it can happen as a and b are numerical derivatives of functions with at least one local maximum

abraemer · April 5, 2024, 2:48pm

You can see what @fastmath does if you follow the link in the docstring:
https://llvm.org/docs/LangRef.html#fast-math-flags

So in general one should use it with care, but I think in this case we should be fine as we are not relying on fragile floating cancellations or unstable calculations and the like.

abraemer · April 5, 2024, 2:57pm

I think if either of ai or bi is zero then then the result should be zero in your original code and this is actually not the case in my code. But it is easy to fix, because if ai ==0 then t1 is min(0,bi) which is zero if bi>0. The case bi==0 does take neither branch and returns always 0 and is thus fine. The following should hopefully be correct:

s[i] = ifelse(ai>=0, ifelse(bi>0, t1, zero(eltype(T))), ifelse(bi<0, t2, zero(eltype(T))))*h

Dan · April 5, 2024, 3:00pm

This looks strange. Is this a typo?

abraemer · April 5, 2024, 3:07pm

Interesting ^^ I never typed this and just copied it from @JM_Beckers who copied it from OP.
It means T needs to be a supertype of Vector{Float64} which does not seem very useful. Essentially the only concrete type matching is Vector{Float64}. Personally I would just scrap all the type annotation here.

@alemelis I cross checked with your original implementation and it seems consistent now:

julia> res_base = let s = zeros(15), a = [0.0,0,0,1,1,1,-1,-1,-1,2,2,2,1,2,3], b = [0.0,-1,1,0,-1,1,0,-1,1,1,2,3,2,2,2]
       superbee!(s,a,b,1.0)
       s
       end
15-element Vector{Float64}:
  0.0
 -0.0
  0.0
  0.0
  0.0
  1.0
 -0.0
 -1.0
  0.0
  2.0
  2.0
  3.0
  2.0
  2.0
  3.0

julia> res3 = let s = zeros(15), a = [0.0,0,0,1,1,1,-1,-1,-1,2,2,2,1,2,3], b = [0.0,-1,1,0,-1,1,0,-1,1,1,2,3,2,2,2]
       superbee!(s,a,b,1.0)
       s
       end;

julia> res_base == res3
true

alemelis · April 5, 2024, 3:27pm

nice, we squeezed those 4 peaks

before

after
Screenshot 2024-04-05 at 16.16.32

now the bulk of them is a call to simdloop.jl, macro expansion: line 77. But I’m happy with it, there are different parts to tackle now

Topic		Replies	Views
Maximum!(v, A) is slower than own implementation? Performance	2	650	March 5, 2018
Rolling maxima Performance	24	1635	February 2, 2020
Large performance regression of vectorized functions on master General Usage	4	609	July 25, 2018
optimising for loop performance Performance	14	3005	July 8, 2018
A problem about performance Performance	30	710	October 3, 2022

Optimising superbee! function

Related topics