Oh wow, that’s surprising to me. Some results on my machine (0.6.2, Linux):
Julia (no threads): Trial(35.786 μs)
BLAS (1 threads): Trial(35.793 μs)
Julia (1 threads): Trial(36.234 μs)
BLAS (2 threads): Trial(35.731 μs)
Julia (2 threads): Trial(19.571 μs)
BLAS (4 threads): Trial(35.775 μs)
Julia (4 threads): Trial(10.698 μs)
BLAS (8 threads): Trial(35.772 μs)
Julia (8 threads): Trial(5.150 μs)
BLAS (16 threads): Trial(35.782 μs)
Julia (16 threads): Trial(4.514 μs)
BLAS (32 threads): Trial(35.760 μs)
Julia (32 threads): Trial(4.050 μs)
BLAS (64 threads): Trial(35.819 μs)
Julia (64 threads): Trial(4.122 μs)
Edit: added more results.