Strange performance of a loop

I think that prior to Sandy Bridge if you accessed Aligned Data using the non aligned load it wouldn’t be efficient.
In modern CPU’s if the data is aligned it doesn’t matter if you use the load which assumes alignment or not.
But I still think accessing unaligned data is slower than aligned data.

But my point is different.
We must make sure the length of the data allocated it a multiplication of 16 Bytes (For SSE) / 32 Bytes (For AVX) / 64 Byte (For AVX512).

The tricky part is dealing with 1D / 2D / 3D / Etc… arrays.