If you have a parallel mapreduce(f, op, x) (e.g., reduce(op, Map(f), x) from Transducers.jl), a neat way to minimize allocation and call mul! would be to use LazyArrays. Something like this (untested):
using LazyArrays: @~
using Transducers: Map
z = reduce(Map(x -> @~ x'x), x; init=nothing) do a, b
a === nothing ? copy(b) : a .+= b
end
See also: Parallel reductions