Not to put too fine a point on it but this was exactly what you were being told in the first thread you opened on this:
The performance of parallel code can change drastically depending on all sorts of optimizations you can make to its simple serial execution, so you really need to make sure your code is as efficient (and allocation-free) as possible before you run elaborate experiments on supercomputers.