The inner loop could potentially see some speedup with threading. Threading does not make a huge difference for code that allocates a lot of memory. Try making use of the insight gained from your other question and see if you can apply it here (I think you can).
BTW, Q*rand should probably call randn?