Ifelse vs ternary with tuples

I believe I have the basic idea of when to use ifelse instead of the ternary operator. Basically, ifelse avoids an expensive branch, but computes the two possibilities β€” which may be more expensive. (Also ifelse may be friendler to autodiff.) But now throw tuples into the mix, and I have questions. Specifically, looking at these lines in the Julia source:

max(x::T, y::T) where {T<:Real} = ifelse(y < x, x, y)
min(x::T, y::T) where {T<:Real} = ifelse(y < x, y, x)
minmax(x::T, y::T) where {T<:Real} = y < x ? (y, x) : (x, y)

For min and max, there’s nothing to compute between x and y, so just use ifelse. But for minmax, would those tuples actually get created? I have this very fuzzy notion that tuples are basically free, and the compiler knows how to deal with them very well. So, is this code saying they’re not that free?

For example, if x and y are BigFloats (which do require allocation), would max and min get away without allocating somehow? Would minmax have to allocate four BigFloats to fill the two tuples, and then just return one of the tuples?

Things get a little confused because the compiler itself can β€” and does β€” do transformations from ?: to ifelse (edit: and vice-versa!) when it can prove it’s advantageous (and ok) to do so. And it’s even more confused because branches aren’t necessarily bad (if they’re predictable and don’t mess up SIMD).

Those lines of code are pretty old, too β€” 10+ years β€” so I’m not sure if they’re necessarily instructive or optimal either.

In fact, Julia v1.11 (but not v1.10) compiles that minmax to the branchless select.

6 Likes

I think the best guidance these days is to just use ?: branches and let the compiler handle it unless you’re hand-crafting SIMD kernels or wanting the convenience (and, e.g., broadcast-ability) of a function.

4 Likes

The above poster beat me to most of this answer, but I think this still adds something so I’m posting it:

The compiler gets its hands on any of these and does what it wants (within the range it’s allowed). It’s common that ternary expressions get compiled to be branchless and also that a nontrivial ifelse uses a branch.

There is a semantic difference in that ifelse must always execute both arguments and ?: must only execute the correct side, but this distinction only matters when either has side-effects. For example, ifelse(true, 1, error()) must error while true ? 1 : error() must not.

In general, it tends to be slightly easier for the compiler to make branch-free code using ifelse rather than ?:. This is why it is desirable when one is attempting to write vectorizable code for hot loops. However, ?: technically allows the compiler more freedom in most situations where ifelse is relevant because it is allowed to not evaluate one side entirely (including side effects).

Since forming a tuple from two objects does not have side effects, it’s likely that either a ?: or ifelse implementation of the functions in the original question result in the same compiled code. code_native is useful here.

julia> code_native((x,y) -> y < x ? (y, x) : (x, y), NTuple{2,Float64}; debuginfo=:none)

julia> code_native((x,y) -> ifelse(y < x, (y, x), (x, y)), NTuple{2,Float64}; debuginfo=:none)

On my x86 running Julia v1.10, I find the assembly of the ifelse more attractive in this case (the ternary does emit a branch while the ifelse does not). However, changing the comparison to y <= x emits branches for both versions and I slightly prefer the ternary.

Keep in mind that a function this small is likely to be inlined, which means that it may interact with other code in significant ways. This means that this is probably not the end of the story. For example, in a loop the compiler may eliminate the branch in order to enable vectorization.

6 Likes

Thanks for the explanations. I guess, as usual, the way to go is to just write code that reads well, and don’t worry about details like this until a profiler says to β€” then experiment to see what the compiler actually does.

2 Likes

Note that the compiler tends to prefer using branches, under the assumption that well predicted code.

One example where this backfires is in binary searches, which are generally impossible to predict.
I get

julia> using Random, BenchmarkTools

julia> x = rand(Int, 2048); s = sort(x);

julia> perm = randperm(length(x));

julia> function findbench(f, x, vals)
           @inbounds for i = eachindex(x, vals)
               v = vals[i]
               f(x[v], x) == v || throw("oops")
           end
       end
findbench (generic function with 1 method)

julia> @benchmark findbench((x,v)->searchsortedfirst(v, x), $s, $perm)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  94.526 ΞΌs … 117.188 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     95.964 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   96.209 ΞΌs Β±   1.087 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

            β–β–‚β–„β–„β–†β–‡β–ˆβ–‡β–†β–†β–„β–‚                                        
  β–‚β–‚β–‚β–‚β–‚β–ƒβ–„β–„β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚ β–„
  94.5 ΞΌs         Histogram: frequency by time         99.6 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

However, if I start Julia via JULIA_LLVM_ARGS="-x86-cmov-converter=false" julia, which disables the cmov->branch conversion (so our code is actually branchless), I get

julia> @benchmark findbench((x,v)->searchsortedfirst(v, x), $s, $perm)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  40.656 ΞΌs … 62.645 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     40.919 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   41.682 ΞΌs Β±  2.223 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–„β–ˆβ–†                 ▁▁▁               ▂▂▁          ▁▂▁      ▁
  β–ˆβ–ˆβ–ˆβ–‡β–ƒβ–β–β–ƒβ–β–β–ƒβ–β–β–β–β–ƒβ–„β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–ƒβ–β–„β–„β–„β–„β–β–„β–ƒβ–†β–ˆβ–ˆβ–ˆβ–†β–β–β–β–ƒβ–β–β–β–β–…β–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–…β–† β–ˆ
  40.7 ΞΌs      Histogram: log(frequency) by time      49.9 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Would be nice if we had a way to tell the compiler β€œno really, I do want this particular code to be branchless!” without using a hammer that impacts all code like that ENV variable.

7 Likes

I didn’t know we could do this! But yeah it would be really nice if we could tell which parts do branchless/branchful.

I did implement one with @asmcall when I was trying to learn more about asm, the performance is comparable. Although, you start to get worse performance if your array does not fit L1/L2 cache.