you might be correct.
no, as Oscar said, the C * C' is already multi-threaded because BLAS is multi-threaded, so dividing like this shouldn’t give you linear speed up
C * C'