Innefficient paralellization? Need some help optimizing a simple dot product

Just a guess, but maybe this is performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub once again (the @threads macro creates a closure). See also see Parallelizing for loop in the computation of a gradient - #7 by tkoolen. Check the code_warntype.