So I am not getting the same output when I use @turbo. Why is it not appropriate to use it in this case? It was not obvious to me from reading LoopVectorization.jl limitations, so I suspect I may have misunderstood something. Thanks.
Here, the loop iterations are somewhat dependent (though they can be executed in any order) because different iterations modify the same y1 element, which means that they cannot be executed in parallel (SIMD is instruction-level parallelism). And you can see that it precisely for these elements (which have value 3.0 after the scalar loop) that the results differ.
I would guess that invariance to shuffling is not enough either. The different iterations should be able to execute simultaneously, so check for possible data races.