One way is to instead do
Wapp = BB * (BB - AA * (BB \ AA))
Solving linear systems is often better than doing direct inversion, both for performance and numerical stability. I can, however, not guarantee that this will help in your particular case, since you are doing matrix-matrix products. For the vector-matrix case, you should definitely do A \ x instead of inv(A) * x.
As far as I can tell, though, your code has a lot of mathematical structure, that you can probably carry a bit further on paper, like the quoted expression, and
where Bf/Sf etc. are actually diagonal matrices? Here you can surely exploit the structure of W to better invert or factorize it.
I am also a bit skeptical whether you are choosing the correct parallelization strategy. You turn off BLAS multi-threading, but have you also done the same in Matlab? Matlab does a lot of implicit multi-threading that is not visible to the user, do you have control over that? Perhaps you can turn off multi-threading for both Matlab and Julia, and compare single threaded performance, also making sure that BLAS-threading is comparable.
Your code is a bit complex to analyze at a glance, could you identify the bottlenecks precisely to help the other posters? And also provide dummy data so others can run your code? Due to xmas/newyears I personally am a bit lazy right now, and respond slowly.
Also, one more thing, your problem is possibly smack in the middle of the Matlab sweet spot, so large speed-ups in other languages may not be realistic, at least not without some extra effort on exploiting matrix structure, in-place operations, etc.