Have you tried passing a setup
block that re-allocates/copies/shuffles around your data?
It could be caching/locality effects, but it could also be so many things. Modern computer architectures (and operating systems) are wild. From thermal throttling to (surprisingly long-lived) branch prediction to caching to multithreading to hyperthreading… you’re performance fine-tuning in a hostile environment.