Improving performance for matrix assembly

I agree with this in general. In this particular case, though, it seems clear that the intention was to build a matrix block-by-block, where the advantage would be less, and also that it had to be pre-allocated.