If LoopVectoriztion doesn’t win a benchmark, that’s grounds for filing an issue. 
E.g., here or here.
Doesn’t mean I’ll be able to address it immediately, but it’d be something for me to look into.
In this case, it should be fixed in LoopVectorization 0.12.8 (just released).
One important thing to note is that this benchmark:
julia> array = randn(1024); out = similar(array);
julia> @btime $out .= $array .* $array;
56.647 ns (0 allocations: 0 bytes)
is bottlenecked by loading from the two arrays: array and array.
Of course, array === array here, so:
julia> @btime $out .= abs2.($array);
44.697 ns (0 allocations: 0 bytes)
Something that (now as of 0.12.8) works for LoopVectorization is wrapping them in a function.
It now inlines the broadcasting code, so even though LoopVectorization has no way of knowing array === array, LLVM knows that:
julia> using LoopVectorization
julia> @inline avxmul!(out,array) = @avx @. out = array * array
avxmul! (generic function with 1 method)
julia> @btime avxmul!($out, $array);
44.482 ns (0 allocations: 0 bytes)
julia> @btime @avx @. $out = abs2($array);
44.626 ns (0 allocations: 0 bytes)
But this doesn’t seem to work for base broadcasting:
julia> @inline basemul!(out, array) = @. out = array * array
basemul! (generic function with 1 method)
julia> @btime basemul!($out, $array);
57.162 ns (0 allocations: 0 bytes)
Which we can confirm by looking at the asm, base:
L1056:
vmovupd zmm0, zmmword ptr [rax + 8*rdi]
vmovupd zmm1, zmmword ptr [rax + 8*rdi + 64]
vmovupd zmm2, zmmword ptr [rax + 8*rdi + 128]
vmovupd zmm3, zmmword ptr [rax + 8*rdi + 192]
vmulpd zmm0, zmm0, zmmword ptr [rcx + 8*rdi]
vmulpd zmm1, zmm1, zmmword ptr [rcx + 8*rdi + 64]
vmulpd zmm2, zmm2, zmmword ptr [rcx + 8*rdi + 128]
vmulpd zmm3, zmm3, zmmword ptr [rcx + 8*rdi + 192]
vmovupd zmmword ptr [rdx + 8*rdi], zmm0
vmovupd zmmword ptr [rdx + 8*rdi + 64], zmm1
vmovupd zmmword ptr [rdx + 8*rdi + 128], zmm2
vmovupd zmmword ptr [rdx + 8*rdi + 192], zmm3
add rdi, 32
cmp rsi, rdi
jne L1056
Note we load from one vector with vmovupd, and then vmulpd loads from a second vector while multiplying with the previously loaded vector.
@avx:
L48:
vmovupd zmm0, zmmword ptr [rdi + 8*rsi]
vmovupd zmm1, zmmword ptr [rdi + 8*rsi + 64]
vmovupd zmm2, zmmword ptr [rdi + 8*rsi + 128]
vmovupd zmm3, zmmword ptr [rdi + 8*rsi + 192]
vmulpd zmm0, zmm0, zmm0
vmulpd zmm1, zmm1, zmm1
vmulpd zmm2, zmm2, zmm2
vmulpd zmm3, zmm3, zmm3
vmovupd zmmword ptr [rdx + 8*rsi], zmm0
vmovupd zmmword ptr [rdx + 8*rsi + 64], zmm1
vmovupd zmmword ptr [rdx + 8*rsi + 128], zmm2
vmovupd zmmword ptr [rdx + 8*rsi + 192], zmm3
add rsi, 32
cmp rsi, rcx
jle L48
Which is the same code abs2 produces.
Anyway, my favorite benchmark:
@inline basemulabs2!(out, array) = @. out = abs2(array)
using Random
Ns = shuffle(axes(array,1));
foreachn(f::F, out, array, Ns) where {F} = foreach(n -> f(view(out, Base.OneTo(n)), view(array, Base.OneTo(n))), Ns)
@btime foreachn(avxmul!, $out, $array, $Ns);
@btime foreachn(basemulabs2!, $out, $array, $Ns);
Yielding:
julia> @btime foreachn(avxmul!, $out, $array, $Ns);
26.862 μs (0 allocations: 0 bytes)
julia> @btime foreachn(basemulabs2!, $out, $array, $Ns);
36.207 μs (0 allocations: 0 bytes)
But this performance advantage is going to be much smaller without AVX512.
Anyway, yes, don’t use StaticArrays at these sizes.
They’re very unfriendly to the compiler, and using NTuples as arrays, or even using LLVM’s array type as arrays, doesn’t really work.
LLVM arrays require constant indices, which means that there is no way to work with them other than to unroll everything. For example, Base.setindex:
julia> @code_typed Base.setindex((1,2,3,4), 3, 4)
CodeInfo(
1 ─ goto #8 if not $(Expr(:boundscheck))
2 ─ %2 = Base.sle_int(1, i)::Bool
└── goto #4 if not %2
3 ─ %4 = Base.sle_int(i, 4)::Bool
└── goto #5
4 ─ nothing::Nothing
5 ┄ %7 = φ (#3 => %4, #4 => false)::Bool
└── goto #7 if not %7
6 ─ goto #8
7 ─ %10 = Base.BoundsError(x, i)::Any
│ Base.throw(%10)::Union{}
└── unreachable
8 ┄ %13 = Core.getfield(x, 1)::Int64
│ %14 = Core.getfield(x, 2)::Int64
│ %15 = Core.getfield(x, 3)::Int64
│ %16 = Core.getfield(x, 4)::Int64
│ %17 = (i === 1)::Bool
│ %18 = Base.ifelse(%17, v, %13)::Int64
│ %19 = Base.sub_int(i, 1)::Int64
│ %20 = (%19 === 1)::Bool
│ %21 = Base.ifelse(%20, v, %14)::Int64
│ %22 = Base.sub_int(%19, 1)::Int64
│ %23 = (%22 === 1)::Bool
│ %24 = Base.ifelse(%23, v, %15)::Int64
│ %25 = Base.sub_int(%22, 1)::Int64
│ %26 = (%25 === 1)::Bool
│ %27 = Base.ifelse(%26, v, %16)::Int64
│ %28 = Core.tuple(%18, %21, %24, %27)::NTuple{4, Int64}
└── return %28
) => NTuple{4, Int64}
Works by creating a new tuple, using ifelse to select which one to replace.
Julia’s codegen can sometimes instead gain access to pointers to NTuples, but users – and base functions like Base.setindex cannot AFAIK.
The best we can do is wrapping them in a mutable struct of some sort, which is what a StaticArrays.MVector is.