Why is BLAS dot product so much faster than Julia loop?

I had a very similar question, with a lot of nice answers

Simple Mat-Vec multiply (understanding performance, without the bugs)

my favorite by far was to use @tullio to avoid coding loops at all, just use Einstein tensor notation

return @tullio x[i]*y[i]