Innefficient paralellization? Need some help optimizing a simple dot product

Oh wow, that’s surprising to me. Some results on my machine (0.6.2, Linux):

Julia (no threads):     Trial(35.786 μs)

BLAS (1 threads):       Trial(35.793 μs)
Julia (1 threads):      Trial(36.234 μs)

BLAS (2 threads):       Trial(35.731 μs)
Julia (2 threads):      Trial(19.571 μs)

BLAS (4 threads):       Trial(35.775 μs)
Julia (4 threads):      Trial(10.698 μs)

BLAS (8 threads):       Trial(35.772 μs)
Julia (8 threads):      Trial(5.150 μs)

BLAS (16 threads):      Trial(35.782 μs)
Julia (16 threads):     Trial(4.514 μs)

BLAS (32 threads):      Trial(35.760 μs)
Julia (32 threads):     Trial(4.050 μs)

BLAS (64 threads):      Trial(35.819 μs)
Julia (64 threads):     Trial(4.122 μs)

Edit: added more results.

4 Likes