Benchmark MATLAB & Julia for Matrix Operations

Again, I think both of your systems are memory bounded since there is no way single thread in GEMM is as fast as 4 Threads.
You need to understand it.
It is not surprise that on my machine with Quad Channel Memory (Though modern CPU’s can get better bandwidth than my machine even with double channel configuration) you see scaling with threading.

Also, 7 runs are very stable. When I built this I tried many numbers and actually even 5 is great.
You need to understand @btime doesn’t do anything magical (Nor MATLAB’s timeit() which internally just do multiple runs and using tic() and toc()). It calls the same CPU timers.
I prefer do that manually and as you can see in the above answer of @jling, Since I’m not doing it in global scope results are correct and reasonable.

Again, it is you who have to explain how can the most optimized function in history - GEMM, which scales beautifully with threads has no scaling in your tests. Do you suggest that the people of OpenBLAS created a function 4 times faster than Intel guys (Which spent hundreds of work years on this)? Com on…

You need to find better arguments to back up those results.