Yes, but note here how you’re calling sum(xs)
and not sum(f, xs)
.
You may need to import LoopVectorization in order for Tullio to generate a fully optimized kernel. More importantly, I would extract deformation_indexed[:, 1, :]
into its own local variable to potentially save on a lot of compute/memory overhead.
Also, what is register
? It seems like there is more code here that may have an influence on performance (e.g. if register
is a mutable struct), so a MWE would be much appreciated.