If LoopVectoriztion doesn’t win a benchmark, that’s grounds for filing an issue. 
E.g., here or here.
Doesn’t mean I’ll be able to address it immediately, but it’d be something for me to look into.
In this case, it should be fixed in LoopVectorization 0.12.8 (just released).
One important thing to note is that this benchmark:
julia> array = randn(1024); out = similar(array);
julia> @btime $out .= $array .* $array;
  56.647 ns (0 allocations: 0 bytes)
is bottlenecked by loading from the two arrays: array and array.
Of course, array === array here, so:
julia> @btime $out .= abs2.($array);
  44.697 ns (0 allocations: 0 bytes)
Something that (now as of 0.12.8) works for LoopVectorization is wrapping them in a function.
It now inlines the broadcasting code, so even though LoopVectorization has no way of knowing array === array, LLVM knows that:
julia> using LoopVectorization
julia> @inline avxmul!(out,array) = @avx @. out = array * array
avxmul! (generic function with 1 method)
julia> @btime avxmul!($out, $array);
  44.482 ns (0 allocations: 0 bytes)
julia> @btime @avx @. $out = abs2($array);
  44.626 ns (0 allocations: 0 bytes)
But this doesn’t seem to work for base broadcasting:
julia> @inline basemul!(out, array) = @. out = array * array
basemul! (generic function with 1 method)
julia> @btime basemul!($out, $array);
  57.162 ns (0 allocations: 0 bytes)
Which we can confirm by looking at the asm, base:
L1056:
        vmovupd zmm0, zmmword ptr [rax + 8*rdi]
        vmovupd zmm1, zmmword ptr [rax + 8*rdi + 64]
        vmovupd zmm2, zmmword ptr [rax + 8*rdi + 128]
        vmovupd zmm3, zmmword ptr [rax + 8*rdi + 192]
        vmulpd  zmm0, zmm0, zmmword ptr [rcx + 8*rdi]
        vmulpd  zmm1, zmm1, zmmword ptr [rcx + 8*rdi + 64]
        vmulpd  zmm2, zmm2, zmmword ptr [rcx + 8*rdi + 128]
        vmulpd  zmm3, zmm3, zmmword ptr [rcx + 8*rdi + 192]
        vmovupd zmmword ptr [rdx + 8*rdi], zmm0
        vmovupd zmmword ptr [rdx + 8*rdi + 64], zmm1
        vmovupd zmmword ptr [rdx + 8*rdi + 128], zmm2
        vmovupd zmmword ptr [rdx + 8*rdi + 192], zmm3
        add     rdi, 32
        cmp     rsi, rdi
        jne     L1056
Note we load from one vector with vmovupd, and then vmulpd loads from a second vector while multiplying with the previously loaded vector.
@avx:
L48:
        vmovupd zmm0, zmmword ptr [rdi + 8*rsi]
        vmovupd zmm1, zmmword ptr [rdi + 8*rsi + 64]
        vmovupd zmm2, zmmword ptr [rdi + 8*rsi + 128]
        vmovupd zmm3, zmmword ptr [rdi + 8*rsi + 192]
        vmulpd  zmm0, zmm0, zmm0
        vmulpd  zmm1, zmm1, zmm1
        vmulpd  zmm2, zmm2, zmm2
        vmulpd  zmm3, zmm3, zmm3
        vmovupd zmmword ptr [rdx + 8*rsi], zmm0
        vmovupd zmmword ptr [rdx + 8*rsi + 64], zmm1
        vmovupd zmmword ptr [rdx + 8*rsi + 128], zmm2
        vmovupd zmmword ptr [rdx + 8*rsi + 192], zmm3
        add     rsi, 32
        cmp     rsi, rcx
        jle     L48
Which is the same code abs2 produces.
Anyway, my favorite benchmark:
@inline basemulabs2!(out, array) = @. out = abs2(array)
using Random
Ns = shuffle(axes(array,1));
foreachn(f::F, out, array, Ns) where {F} = foreach(n -> f(view(out, Base.OneTo(n)), view(array, Base.OneTo(n))), Ns)
@btime foreachn(avxmul!, $out, $array, $Ns);
@btime foreachn(basemulabs2!, $out, $array, $Ns);
Yielding:
julia> @btime foreachn(avxmul!, $out, $array, $Ns);
  26.862 μs (0 allocations: 0 bytes)
julia> @btime foreachn(basemulabs2!, $out, $array, $Ns);
  36.207 μs (0 allocations: 0 bytes)
But this performance advantage is going to be much smaller without AVX512.
Anyway, yes, don’t use StaticArrays at these sizes.
They’re very unfriendly to the compiler, and using NTuples as arrays, or even using LLVM’s array type as arrays, doesn’t really work.
LLVM arrays require constant indices, which means that there is no way to work with them other than to unroll everything. For example, Base.setindex:
julia> @code_typed Base.setindex((1,2,3,4), 3, 4)
CodeInfo(
1 ─       goto #8 if not $(Expr(:boundscheck))
2 ─ %2  = Base.sle_int(1, i)::Bool
└──       goto #4 if not %2
3 ─ %4  = Base.sle_int(i, 4)::Bool
└──       goto #5
4 ─       nothing::Nothing
5 ┄ %7  = φ (#3 => %4, #4 => false)::Bool
└──       goto #7 if not %7
6 ─       goto #8
7 ─ %10 = Base.BoundsError(x, i)::Any
│         Base.throw(%10)::Union{}
└──       unreachable
8 ┄ %13 = Core.getfield(x, 1)::Int64
│   %14 = Core.getfield(x, 2)::Int64
│   %15 = Core.getfield(x, 3)::Int64
│   %16 = Core.getfield(x, 4)::Int64
│   %17 = (i === 1)::Bool
│   %18 = Base.ifelse(%17, v, %13)::Int64
│   %19 = Base.sub_int(i, 1)::Int64
│   %20 = (%19 === 1)::Bool
│   %21 = Base.ifelse(%20, v, %14)::Int64
│   %22 = Base.sub_int(%19, 1)::Int64
│   %23 = (%22 === 1)::Bool
│   %24 = Base.ifelse(%23, v, %15)::Int64
│   %25 = Base.sub_int(%22, 1)::Int64
│   %26 = (%25 === 1)::Bool
│   %27 = Base.ifelse(%26, v, %16)::Int64
│   %28 = Core.tuple(%18, %21, %24, %27)::NTuple{4, Int64}
└──       return %28
) => NTuple{4, Int64}
Works by creating a new tuple, using ifelse to select which one to replace.
Julia’s codegen can sometimes instead gain access to pointers to NTuples, but users – and base functions like Base.setindex cannot AFAIK.
The best we can do is wrapping them in a mutable struct of some sort, which is what a StaticArrays.MVector is.