I’m not a LoopVectorization user so others will chime in but did you look at the package’s Readme, in particular this part:
We expect that any time you use the @turbo macro with a given block of code that you:
3. Are not relying on a specific execution order. @turbo can and will re-order operations and loops inside its scope, so the correctness cannot depend on a particular order. You cannot implement cumsum with @turbo.
I admit I’m still a little confused here. Clearly, this example does not rely on the loop execution order – each loop operates on only a single element. But they do rely on the instruction order within a single loop iteration. Almost every program does, for that matter.
I haven’t used it before, but I had assumed LoopVectorization was not usually in the business of transforming dependent instruction orderings. It’s hard to write a program when it will occasionally transform (a + b) * c to a * c, as appears to have happened here.
Why did this fail? Would it have worked if it were instead written as this? This appears to work correctly
@turbo for i = 1:length(tbuf)
z = 1.0
z += 1.0
tbuf[i] = z * -1.0
So the issue appears related to the repeated accesses to tbuf[i] in the original. Can someone clarify what patterns are and aren’t safe?