Multiple Outputs (Tuple) from IfElse with VectorizationBase

Hi there,

I am trying to experiment a bit with VectorizationBase to create functions that can be used inside the @turbo macro from LoopVectorization.

I have a use case where my computation can be significantly sped up in case one of my inputs is 0 (not only sped up but the full routine would lead to NaN values for points where the input is 0) so I was thinking of generating two subfunctions that support Vec inputs from VectorizationBase and call them using IfElse.ifelse.

Since I am dealing with Complex outputs and LoopVectorization does not currently support complex inputs, I would need to output 2 values out of the subfunctions (or two sets of Vec, one for the real and one for the complex part of the result).

Unfortunately it seems that IfElse.ifelse fails as soon as I want to produce more than one output from my subfunctions.

Here is a MWE of the problem I am facing.

using IfElse, VectorizationBase

 vx = Vec(ntuple(_ -> rand(0:3), VectorizationBase.pick_vector_width(Int64))...)

_simple(x) = (x,x)

_full(x) = (10x,5x)

IfElse.ifelse(vx == 0,_simple(vx)[1],_full(vx)[1])

The above example work, but if one removes the indexing inside the ifelse as follows:
IfElse.ifelse(vx == 0,_simple(vx),_full(vx))

I get the following error:

I could use two ifelse statements to gather both inputs separately but I guess that would re-perfom all the computations inside the subfunctions twice.

Is getting two (or more) outputs out of the ifelse possible or is it not supported yet?

Tagging @Elrod as he might be able to shed light into this :slight_smile:

I could add a method.
And you wouldn’t have to recompute:

s = _simple(vx)
f = _full(vx)
x = IfElse.ifelse(vx == 0, s[1], f[1])
y = IfElse.ifelse(vx == 0, s[2], f[2])

Alternatively, this should already work: == 0, VecUnroll(_simple(vx)), VecUnroll(_full(vx))))
1 Like

Thanks a lot,

Looking at your first proposed solution, this implies that whenever you have a ifelse branch both function are evaluated for each input before doing selection with the mask?

So there would be no actual time saving here by doing a branch inside the @turbo macro since both the simple and full version are evaluated for each input?

Much of LoopVectorization’s performance benefit comes from SIMD, which stands for “Single Instruction Multiple Data”.

Basically, it applies each instruction to multiple data points. Trying to do this while handling branches is tricky.
Referring to each iteration of a loop / element being processed as a “lane”, the simplest approach is to have each lane take both sides of the branch and then combine the results afterwards based on which side of the branch each particular lane would’ve taken.

If you have a rarely taken path that is also very slow, e.g. you have a function that needs lots of special handling over a certain rarely encountered range of the input, you could insert an actual branch and check if VectorizationBase.vany(condition) to only evaluate that branch if it is actually needed.

1 Like