I found that using ForwardDiff for getting the residuals is slightly more efficient (but can deal with the additional viscosity).
Any idea on how to use forward mode to get the residual and reverse mode for the grads?
I saw this post: How to achieve good performance with Zygote.pushforward on a neural network which might solve this issue, however, it is outdated and the codes there are not working anymore.