A good idea is to check for the performance of each function separately, for instance using
julia> @btime mat_z(... inputs ...)
where you provide to the function some representative inputs. When these are functions called multiple times within loops, that is important, and easier to understand than the complete profile.
The most important optimizations will come from removing temporary allocations, as mentioned above. This post may help: Common allocation mistakes