Performance of norm function

Hah, thanks!

So no reason to make a PR, since this affects tiny vectors only, and people caring about speed on tiny vectors presumably use SVector already.

That being said, the obvious fast variant still outperforms the BLAS on my system (which presumably is a misconfiguration / build issue).

julia> A=rand(10_000);
julia> function foo2(A) 
                x = zero(eltype(A))
                @inbounds  @simd for v in A
                  @fastmath x += v * v
                end
                @fastmath sqrt(x)
              end
julia> @btime norm($A)
  4.643 μs (0 allocations: 0 bytes)
57.51021904090062

julia> @btime foo2($A)
  1.376 μs (0 allocations: 0 bytes)
57.51021904090062