I have been using @btime to check the speed of various little functions I’m working on that involve conditionals. I’ve found that multiplying by booleans instead of using if/else statements seems to be much faster. But I’m puzzled as to why that’s the case and I’m hoping somebody might be able to explain it. Multiplying booleans works okay for simple situations (e.g., at most two nested if/else) but for more complex situations, readability really suffers so I’d be interested to learn a way to make if/else as fast if there is one.
Below is a test case to show my findings:
function f1(x::Array{Float64,1},B::Array{Bool,1})
n=length(x)
y=0
for i=1:n
if B[i]==true
y = x[i]^2
else
y = 0
end
end
return y
end
function f2(x::Array{Float64,1},B::Array{Bool,1})
n=length(x)
y=0
for i=1:n
y = B[i]*x[i]^2 + (1 - B[i])*0
end
return y
end
function f3(x::Array{Float64,1},B::Array{Bool,1})
n=length(x)
y=0
for i=1:n
y = B[i] == true ? x[i]^2 : 0
end
return y
end
x = rand(10000)
B = rand(Bool,10000)
@btime f1($x, $B)
@btime f2($x, $B)
@btime f3($x, $B)
A slightly strange benchmark, but if you use ifelse you can remove the branch and get the same speed as multiplying by the bools: ifelse(B[i], x[i]^2, 0).
Avoiding the branch allows the compiler to use SIMD intrinsics that process multiple elements of the input arrays at the same time. The conventional syntax is something like y = ifelse(B[i], x[i]^2, 0.0), which will evaluate both of the branches simultaneously. For straight-line code like this, it’s also worth throwing an @inbounds on the loop or trying LoopVectorization:
function f4(x::Array{T,1},B) where {T}
n=length(x)
y=zero(T)
@avxt for i=1:n
y = ifelse(B[i], x[i]^2, zero(T))
end
return y
end
Note that all of these functions only return the last iteration. It seems that this affects the benchmarks, as replacing y = with y += changes the times (and the results!). Initialising y = 0.0 to be type-stable also affects things (3rd line of each):
julia> @btime f1($x, $B)
7.083 μs (0 allocations: 0 bytes) # as above
0
48.167 μs (0 allocations: 0 bytes) # with y+=... (f1 could drop else y += 0 clause)
1638.8987305127796
9.250 μs (0 allocations: 0 bytes) # with y = 0.0 init, still y += 0
1638.8987305127796
julia> @btime f2($x, $B) # (f2 could drop + (1 - B[i])*0 term)
4.708 μs (0 allocations: 0 bytes)
0.0
56.125 μs (0 allocations: 0 bytes)
1638.8987305127796
9.250 μs (0 allocations: 0 bytes) # with y = 0.0 init
1638.8987305127796
julia> @btime f3($x, $B)
7.625 μs (0 allocations: 0 bytes)
0
40.459 μs (0 allocations: 0 bytes)
1638.8987305127796
10.875 μs (0 allocations: 0 bytes) # with y = 0.0 init, still else 0
1638.8987305127796
julia> @btime f4($x, $B) # LoopVectorization version
404.790 ns (0 allocations: 0 bytes)
0.0
2.648 μs (0 allocations: 0 bytes)
1638.8987305127803
julia> @btime @tullio y := ifelse($B[i], $x[i]^2, 0.0) avx=false threads=false
4.649 μs (0 allocations: 0 bytes)
1638.898730512781
julia> @btime @tullio y := $B[i] * $x[i]^2 # avx=true
2.662 μs (1 allocation: 16 bytes)
1638.8987305127807
No idea if this matters for speed, but it leaps out at me because my students keep doing it[1]: since B[i] is already a boolean, you don’t need ==true, you can just do if B[i]
also because in one of my only PRs to the Julia language, I did basically the same thing and was gently educated by Jeff… Still embarrassed about that ↩︎
Thanks for all the replies. It’s good to understand that the source of the speed differences essentially boils down to parallelization versus sequential computing.
I think this would mean that if one branch of the if/else statement is particularly computationally burdensome and seldom used then if/else might be faster than ifelse. Is that correct?
Also appreciate the tips on how to parallelize loops using LoopVectorization. That will come in handy for sure.
I read in another post that it is better to implement parallelization in top level functions rather than functions that may be several layers down the stack. So I didn’t make any attempts at gaining speed through parallelization with the functions I’m working on now because they will be helpers.
I would appreciate any thoughts people have about the question of where parallelization should be implemented in a complex project if not everywhere.
Yes, that’s right (or if the branches are otherwise complicated or beefy enough to make SIMD difficult).
It’s hard to make any blanket statements about the appropriate level at which to apply parallelism - it’s very computation-dependent (although Julia’s thread-launch overhead does favor applying multithreading closer to the top level). In almost every case, you’re best served by writing straightforward single-threaded code, figuring out where the hotspots are with something like ProfileView.jl or TimerOutputs.jl, then going back to optimize hot spots once you know where they are.
The type of parallelization happening here is simd (single instruction, multiple data) and has to happen on a ‘low level’.
This is different from multithreading, which should generally happen ‘higher up’. Note, however, that Julia’s threading model is composable, so you can multithread on multiple levels, and it all works together. (And note, again, that LoopVectorization’s multithreading does not yet compose with Julia native threads).
In this case, x^2 should lower to x * x, which takes very little time, and can be easily vectorized. At that point, the calculation is almost entirely dependent on memory bandwidth/throughput – the fastest solutions are already chewing through 2-4 elements per nanosecond.
To be clear, the code I shared is not actually a piece of work I’m trying to accomplish. It was meant only to illustrate the speed differences. So for sure the (1-B[i])*0 is totally unnecessary but I put it in there because without it, the code is not really doing the same thing as the branching if/else version. i.e., it’s not doing an allocation when the boolean evaluates to false.
I guess it probably isn’t good form to post code that isn’t “real” in the sense of something I’m actually trying to do. But I just wanted to understand the theory behind the speed differences a bit better so I could hopefully learn about some generally applicable principles.
Note that the if/else version also has an unnecessary step, in that adding zero to y in an else clause could just be dropped. Assuming always that you meant to add – otherwise doing just the last iteration will be much faster, of course:
Also, note that the compiler is fairly often able to turn if else into branchless ifelse if the branches are simple enough. That seems to be what happens here – see my times, once you solve type stability f1 and f2 give identical times, f3 probably within noise too.
So I went down a bit of an internet rabbit hole last night but there are some interesting insights I learned about that I wanted to share. I’m new to both Julia and coding, so please correct me if I have misunderstood anything.
First, my original code did not include the @inbounds macro, so it seems that none of the functions would have been able to make use of SIMD due to the branch associated with bounds checking. This, I think, means that the speed differences cannot be attributed to use (or not) of SIMD between the different functions.
I found several articles however (IMHO one of the best ones is here) noting that branching is inherently costly on modern CPUs due to pipelining. As I understand it, even without using SIMD, pipelined CPUs will compute different stages of multiple instructions during a single clock cycle. When there’s a branching command, a CPU with branch prediction (as in my core i7) will guess which branch is likely to be taken in order to keep the pipeline full and begin working on instructions associated with the predicted branch. If it guesses wrong, there’s a speed cost because the instruction pipeline needs to be flushed and reloaded.
By avoiding the branching, multiplying by booleans avoids the potential for branch prediction errors in the CPU. In addition, my code probably made branch prediction errors more likely by randomly generating the booleans.
Altogether, I think I’ll implement the following practices going forward:
For branches that involve simple instructions, it may be more costly to avoid executing the “wrong” instructions through branching if/else statements because of a) precluding SIMD; and b) increasing the risk of CPU branch prediction errors. So, for these situations, use ifelse() and @inbounds to allow SIMD to work.
For more complex instructions, test both ways to see which is actually faster, since avoiding costly branches could overcome the the tradeoff associated with no SIMD and branch prediction errors.
Freely implement parallelism using @threads or @spawn at any level on account of Julia’s composable parallelism.
Thanks everyone for the input. This has been a very insightful discussion. Would of course appreciate any other helpful practices to add to my list.
note that the compiler is fairly often able to turn if else into branchless ifelse if the branches are simple enough.
This is an important point too. I guess that would be revealed only by inspecting the low-level code right? How does the compiler decide whether branchless is more efficient than branched?