I actually already get segfaults for M=10
.
If I increase ThreadingUtilites.THREADBUFFERSIZE to 1024, it works fine.
Basically, @tturbo
is trying to store more variables in the buffer than actually fit, causing the crashes.
For each variable, it requires a number of bytes equal to the SIMD width. That is 64 for AVX512, or 32 for AVX2.
It also needs some of the storage for other things, so I’d expect only 7 variables for AVX512 and 15 for AVX2. I’d have to double check, but if it’s already crashing for M=15
, I guess it’s using a bit more than that.
You could increase the buffer.
Or maybe, because it’s possible to check at compile time if it fits, it could use a check itself.
The fundamental problem with LoopVectorization.jl here is that it currently only handles “loop independent” dependencies and not “loop carried” dependencies aside from reductions.
Loop independent dependencies are those that only within a loop iteration, and thus that don’t stop you from reordering the iterations arbitrarily.
Loop carried are those between iterations.
If it handled loop carried dependencies, you could write the loading/storing to rez
as part of the loop and still get correct answers. It should then also be able to optimize the situation better when M
is large (by not unrolling it entirely).
I’m working on support for this, but I’d expect it to take a long time, as it is part of a ground up rewrite of the library.