Strange performance of a loop

What you show is only the address of the 1st element.
What I suggest is making sure any Loop with vectorization won’t have “Anomaly” to take care of.

I meant something like Intel IPP.

If we define 1D array it will be padded to have size which is multiplication of 16 / 32 / 64 Bytes.
If you define 2D array it will be padded with rows which are also multiplication of 16 / 32 / 64 Bytes.

This way all loops will be able to be unrolled and vectorized with no issues about taking care of edge cases.