If LoopVectoriztion doesn’t win a benchmark, that’s grounds for filing an issue.
E.g., here or here.
Doesn’t mean I’ll be able to address it immediately, but it’d be something for me to look into.
In this case, it should be fixed in LoopVectorization 0.12.8 (just released).
One important thing to note is that this benchmark:
julia> array = randn(1024); out = similar(array);
julia> @btime $out .= $array .* $array;
56.647 ns (0 allocations: 0 bytes)
is bottlenecked by loading from the two arrays: array
and array
.
Of course, array === array
here, so:
julia> @btime $out .= abs2.($array);
44.697 ns (0 allocations: 0 bytes)
Something that (now as of 0.12.8) works for LoopVectorization is wrapping them in a function.
It now inlines the broadcasting code, so even though LoopVectorization has no way of knowing array === array
, LLVM knows that:
julia> using LoopVectorization
julia> @inline avxmul!(out,array) = @avx @. out = array * array
avxmul! (generic function with 1 method)
julia> @btime avxmul!($out, $array);
44.482 ns (0 allocations: 0 bytes)
julia> @btime @avx @. $out = abs2($array);
44.626 ns (0 allocations: 0 bytes)
But this doesn’t seem to work for base broadcasting:
julia> @inline basemul!(out, array) = @. out = array * array
basemul! (generic function with 1 method)
julia> @btime basemul!($out, $array);
57.162 ns (0 allocations: 0 bytes)
Which we can confirm by looking at the asm, base:
L1056:
vmovupd zmm0, zmmword ptr [rax + 8*rdi]
vmovupd zmm1, zmmword ptr [rax + 8*rdi + 64]
vmovupd zmm2, zmmword ptr [rax + 8*rdi + 128]
vmovupd zmm3, zmmword ptr [rax + 8*rdi + 192]
vmulpd zmm0, zmm0, zmmword ptr [rcx + 8*rdi]
vmulpd zmm1, zmm1, zmmword ptr [rcx + 8*rdi + 64]
vmulpd zmm2, zmm2, zmmword ptr [rcx + 8*rdi + 128]
vmulpd zmm3, zmm3, zmmword ptr [rcx + 8*rdi + 192]
vmovupd zmmword ptr [rdx + 8*rdi], zmm0
vmovupd zmmword ptr [rdx + 8*rdi + 64], zmm1
vmovupd zmmword ptr [rdx + 8*rdi + 128], zmm2
vmovupd zmmword ptr [rdx + 8*rdi + 192], zmm3
add rdi, 32
cmp rsi, rdi
jne L1056
Note we load from one vector with vmovupd
, and then vmulpd
loads from a second vector while multiplying with the previously loaded vector.
@avx
:
L48:
vmovupd zmm0, zmmword ptr [rdi + 8*rsi]
vmovupd zmm1, zmmword ptr [rdi + 8*rsi + 64]
vmovupd zmm2, zmmword ptr [rdi + 8*rsi + 128]
vmovupd zmm3, zmmword ptr [rdi + 8*rsi + 192]
vmulpd zmm0, zmm0, zmm0
vmulpd zmm1, zmm1, zmm1
vmulpd zmm2, zmm2, zmm2
vmulpd zmm3, zmm3, zmm3
vmovupd zmmword ptr [rdx + 8*rsi], zmm0
vmovupd zmmword ptr [rdx + 8*rsi + 64], zmm1
vmovupd zmmword ptr [rdx + 8*rsi + 128], zmm2
vmovupd zmmword ptr [rdx + 8*rsi + 192], zmm3
add rsi, 32
cmp rsi, rcx
jle L48
Which is the same code abs2
produces.
Anyway, my favorite benchmark:
@inline basemulabs2!(out, array) = @. out = abs2(array)
using Random
Ns = shuffle(axes(array,1));
foreachn(f::F, out, array, Ns) where {F} = foreach(n -> f(view(out, Base.OneTo(n)), view(array, Base.OneTo(n))), Ns)
@btime foreachn(avxmul!, $out, $array, $Ns);
@btime foreachn(basemulabs2!, $out, $array, $Ns);
Yielding:
julia> @btime foreachn(avxmul!, $out, $array, $Ns);
26.862 μs (0 allocations: 0 bytes)
julia> @btime foreachn(basemulabs2!, $out, $array, $Ns);
36.207 μs (0 allocations: 0 bytes)
But this performance advantage is going to be much smaller without AVX512.
Anyway, yes, don’t use StaticArrays at these sizes.
They’re very unfriendly to the compiler, and using NTuple
s as arrays, or even using LLVM’s array type as arrays, doesn’t really work.
LLVM arrays require constant indices, which means that there is no way to work with them other than to unroll everything. For example, Base.setindex
:
julia> @code_typed Base.setindex((1,2,3,4), 3, 4)
CodeInfo(
1 ─ goto #8 if not $(Expr(:boundscheck))
2 ─ %2 = Base.sle_int(1, i)::Bool
└── goto #4 if not %2
3 ─ %4 = Base.sle_int(i, 4)::Bool
└── goto #5
4 ─ nothing::Nothing
5 ┄ %7 = φ (#3 => %4, #4 => false)::Bool
└── goto #7 if not %7
6 ─ goto #8
7 ─ %10 = Base.BoundsError(x, i)::Any
│ Base.throw(%10)::Union{}
└── unreachable
8 ┄ %13 = Core.getfield(x, 1)::Int64
│ %14 = Core.getfield(x, 2)::Int64
│ %15 = Core.getfield(x, 3)::Int64
│ %16 = Core.getfield(x, 4)::Int64
│ %17 = (i === 1)::Bool
│ %18 = Base.ifelse(%17, v, %13)::Int64
│ %19 = Base.sub_int(i, 1)::Int64
│ %20 = (%19 === 1)::Bool
│ %21 = Base.ifelse(%20, v, %14)::Int64
│ %22 = Base.sub_int(%19, 1)::Int64
│ %23 = (%22 === 1)::Bool
│ %24 = Base.ifelse(%23, v, %15)::Int64
│ %25 = Base.sub_int(%22, 1)::Int64
│ %26 = (%25 === 1)::Bool
│ %27 = Base.ifelse(%26, v, %16)::Int64
│ %28 = Core.tuple(%18, %21, %24, %27)::NTuple{4, Int64}
└── return %28
) => NTuple{4, Int64}
Works by creating a new tuple, using ifelse
to select which one to replace.
Julia’s codegen can sometimes instead gain access to pointers to NTuple
s, but users – and base functions like Base.setindex
cannot AFAIK.
The best we can do is wrapping them in a mutable struct
of some sort, which is what a StaticArrays.MVector
is.