@distributed (op) for with mutable types

You are right. mapreduce in Base has no parallelism. Perhaps Distributed.jl should add it. (You can also use Transducers.dreduce(op, Map(f), itr))

This is not generally true. Functional approach can have much better performance characteristics (see, e.g., [RFC/ANN] FLoops.jl: fast generic for loops (foldl for humans™)).

You need two accumulators to interact to get this. So, you can’t just independently annotate each reduction variable.