Multiplying by booleans faster than if/else

I have been using @btime to check the speed of various little functions I’m working on that involve conditionals. I’ve found that multiplying by booleans instead of using if/else statements seems to be much faster. But I’m puzzled as to why that’s the case and I’m hoping somebody might be able to explain it. Multiplying booleans works okay for simple situations (e.g., at most two nested if/else) but for more complex situations, readability really suffers so I’d be interested to learn a way to make if/else as fast if there is one.

Below is a test case to show my findings:

function f1(x::Array{Float64,1},B::Array{Bool,1})
    n=length(x)
    y=0
    for i=1:n
        if B[i]==true
            y = x[i]^2
        else
            y = 0
        end
    end
    return y
end

function f2(x::Array{Float64,1},B::Array{Bool,1})
    n=length(x)
    y=0
    for i=1:n
        y = B[i]*x[i]^2 + (1 - B[i])*0
    end      
    return y
end

function f3(x::Array{Float64,1},B::Array{Bool,1})
    n=length(x)
    y=0
    for i=1:n
        y = B[i] == true ? x[i]^2 : 0
    end   
    return y
end

x = rand(10000)
B = rand(Bool,10000)

@btime f1($x, $B)
@btime f2($x, $B)
@btime f3($x, $B)

results:

  31.100 μs (0 allocations: 0 bytes)
  4.757 μs (0 allocations: 0 bytes)
  29.500 μs (0 allocations: 0 bytes)

A slightly strange benchmark, but if you use ifelse you can remove the branch and get the same speed as multiplying by the bools: ifelse(B[i], x[i]^2, 0).

1 Like

Avoiding the branch allows the compiler to use SIMD intrinsics that process multiple elements of the input arrays at the same time. The conventional syntax is something like y = ifelse(B[i], x[i]^2, 0.0), which will evaluate both of the branches simultaneously. For straight-line code like this, it’s also worth throwing an @inbounds on the loop or trying LoopVectorization:

function f4(x::Array{T,1},B) where {T}
  n=length(x)
  y=zero(T)
  @avxt for i=1:n
      y = ifelse(B[i], x[i]^2, zero(T))
  end      
  return y
end
julia> @btime f4($x, $B)
  369.307 ns (0 allocations: 0 bytes)
0.0
1 Like

Note that all of these functions only return the last iteration. It seems that this affects the benchmarks, as replacing y = with y += changes the times (and the results!). Initialising y = 0.0 to be type-stable also affects things (3rd line of each):

julia> @btime f1($x, $B)
  7.083 μs (0 allocations: 0 bytes)  # as above
0
  48.167 μs (0 allocations: 0 bytes)  # with y+=... (f1 could drop  else y += 0  clause)
1638.8987305127796
  9.250 μs (0 allocations: 0 bytes)  # with y = 0.0 init, still y += 0
1638.8987305127796

julia> @btime f2($x, $B)  # (f2 could drop  + (1 - B[i])*0  term)
  4.708 μs (0 allocations: 0 bytes)
0.0
  56.125 μs (0 allocations: 0 bytes)
1638.8987305127796
  9.250 μs (0 allocations: 0 bytes)  # with y = 0.0 init
1638.8987305127796

julia> @btime f3($x, $B)
  7.625 μs (0 allocations: 0 bytes)
0
  40.459 μs (0 allocations: 0 bytes)
1638.8987305127796
  10.875 μs (0 allocations: 0 bytes)  # with y = 0.0 init, still else 0
1638.8987305127796

julia> @btime f4($x, $B)  # LoopVectorization version
  404.790 ns (0 allocations: 0 bytes)
0.0
  2.648 μs (0 allocations: 0 bytes)
1638.8987305127803

julia> @btime @tullio y := ifelse($B[i], $x[i]^2, 0.0)  avx=false threads=false
  4.649 μs (0 allocations: 0 bytes)
1638.898730512781

julia> @btime @tullio y := $B[i] * $x[i]^2  # avx=true
  2.662 μs (1 allocation: 16 bytes)
1638.8987305127807
1 Like

No idea if this matters for speed, but it leaps out at me because my students keep doing it[^1]: since B[i] is already a boolean, you don’t need ==true, you can just do if B[i]

[^1]: also because in one of my only PRs to the Julia language, I did basically the same thing and was gently educated by Jeff… Still embarrassed about that

6 Likes

Thanks for all the replies. It’s good to understand that the source of the speed differences essentially boils down to parallelization versus sequential computing.

I think this would mean that if one branch of the if/else statement is particularly computationally burdensome and seldom used then if/else might be faster than ifelse. Is that correct?

Also appreciate the tips on how to parallelize loops using LoopVectorization. That will come in handy for sure.

I read in another post that it is better to implement parallelization in top level functions rather than functions that may be several layers down the stack. So I didn’t make any attempts at gaining speed through parallelization with the functions I’m working on now because they will be helpers.

I would appreciate any thoughts people have about the question of where parallelization should be implemented in a complex project if not everywhere.

Thanks for pointing that out!

1 Like

Yes, that’s right (or if the branches are otherwise complicated or beefy enough to make SIMD difficult).

It’s hard to make any blanket statements about the appropriate level at which to apply parallelism - it’s very computation-dependent (although Julia’s thread-launch overhead does favor applying multithreading closer to the top level). In almost every case, you’re best served by writing straightforward single-threaded code, figuring out where the hotspots are with something like ProfileView.jl or TimerOutputs.jl, then going back to optimize hot spots once you know where they are.

The type of parallelization happening here is simd (single instruction, multiple data) and has to happen on a ‘low level’.

This is different from multithreading, which should generally happen ‘higher up’. Note, however, that Julia’s threading model is composable, so you can multithread on multiple levels, and it all works together. (And note, again, that LoopVectorization’s multithreading does not yet compose with Julia native threads).

1 Like

After squinting at this for a while: what’s going on here? Why not just

y = B[i]*x[i]^2

?

After squinting at this for a while: what’s going on here? Why not just

y = B[i]*x[i]^2

Answer: because x[i]^2 can take a long time to compute and you only want to compute it if B[i] is equal to true.

If that’s the case, you want branching. 0*slowfunction() will take as long as 1*slowfunction()

And either way, B[i]*0 does nothing.

In this case, x^2 should lower to x * x, which takes very little time, and can be easily vectorized. At that point, the calculation is almost entirely dependent on memory bandwidth/throughput – the fastest solutions are already chewing through 2-4 elements per nanosecond.

No, I’m talking about this part:

(1 - B[i])*0

which doesn’t seem to have any purpose.

1 Like

To be clear, the code I shared is not actually a piece of work I’m trying to accomplish. It was meant only to illustrate the speed differences. So for sure the (1-B[i])*0 is totally unnecessary but I put it in there because without it, the code is not really doing the same thing as the branching if/else version. i.e., it’s not doing an allocation when the boolean evaluates to false.

I guess it probably isn’t good form to post code that isn’t “real” in the sense of something I’m actually trying to do. But I just wanted to understand the theory behind the speed differences a bit better so I could hopefully learn about some generally applicable principles.

For reference this is called branchless programming.

1 Like

Note that the if/else version also has an unnecessary step, in that adding zero to y in an else clause could just be dropped. Assuming always that you meant to add – otherwise doing just the last iteration will be much faster, of course:

julia> f5(x::Array{T,1},B) where {T} = ifelse(B[end], x[end]^2, zero(T))

julia> @btime f5($x, $B)
  1.333 ns (0 allocations: 0 bytes)
0.0

Also, note that the compiler is fairly often able to turn if else into branchless ifelse if the branches are simple enough. That seems to be what happens here – see my times, once you solve type stability f1 and f2 give identical times, f3 probably within noise too.

1 Like

This doesn’t make it any clearer. Neither version allocates anything. I would say they are more comparable if you remove that unnecessary bit.

1 Like

So I went down a bit of an internet rabbit hole last night but there are some interesting insights I learned about that I wanted to share. I’m new to both Julia and coding, so please correct me if I have misunderstood anything.

First, my original code did not include the @inbounds macro, so it seems that none of the functions would have been able to make use of SIMD due to the branch associated with bounds checking. This, I think, means that the speed differences cannot be attributed to use (or not) of SIMD between the different functions.

I found several articles however (IMHO one of the best ones is here) noting that branching is inherently costly on modern CPUs due to pipelining. As I understand it, even without using SIMD, pipelined CPUs will compute different stages of multiple instructions during a single clock cycle. When there’s a branching command, a CPU with branch prediction (as in my core i7) will guess which branch is likely to be taken in order to keep the pipeline full and begin working on instructions associated with the predicted branch. If it guesses wrong, there’s a speed cost because the instruction pipeline needs to be flushed and reloaded.

By avoiding the branching, multiplying by booleans avoids the potential for branch prediction errors in the CPU. In addition, my code probably made branch prediction errors more likely by randomly generating the booleans.

Altogether, I think I’ll implement the following practices going forward:

  1. For branches that involve simple instructions, it may be more costly to avoid executing the “wrong” instructions through branching if/else statements because of a) precluding SIMD; and b) increasing the risk of CPU branch prediction errors. So, for these situations, use ifelse() and @inbounds to allow SIMD to work.

  2. For more complex instructions, test both ways to see which is actually faster, since avoiding costly branches could overcome the the tradeoff associated with no SIMD and branch prediction errors.

  3. Freely implement parallelism using @threads or @spawn at any level on account of Julia’s composable parallelism.

Thanks everyone for the input. This has been a very insightful discussion. Would of course appreciate any other helpful practices to add to my list.

note that the compiler is fairly often able to turn if else into branchless ifelse if the branches are simple enough.

This is an important point too. I guess that would be revealed only by inspecting the low-level code right? How does the compiler decide whether branchless is more efficient than branched?