I’m definitely not an image processing person by any means, but it might be worth thinking about whether this is really an effective measure for similarity of images. I’d be doubtful about how far you could take the concept of “lag” or displacement here especially since there is no sense in which x_{i+1} follows x_{i} in a sequence if they lie on the boundary of a row of pixels.
Just thinking out loud here, I would think that one thing you could do would be to subtract the images from each other and perform some sort of test to see if the difference is “random noise”. Of course this is easier said than done, and perhaps it would be a good place for some sort of dimensionality reduction.
Having said all that. I would think that the conceptually simplest way to do what you are trying to do would be to just vectorize the image. Then you can just use crosscor and know exactly what you are looking at.
You also might want to check out the various packages here.