How to perform parallel vector addition?

I find this interesting, as my impression was that since numpy calls optimised C functions, it’s unlikely that Julia will perform significantly better for simple vector or BLAS operations. Where can one find a 10x improvement? Is it through loop fusion instead of re-allocating, as is presumably done by numpy?