Have you tried factored kernels: https://juliaimages.github.io/latest/imagefiltering.html#Factored-kernels-1? Also, potentially it could be worth moving the last dim to be the first as that would lead to better memory locality.
1 Like