Is the following the best (minimize allocations and speed) way to get the minimum of the second column of a matrix?
minimum(@view x[:,2])
Seems reasonable but checking to see if I am missing something.
Is the following the best (minimize allocations and speed) way to get the minimum of the second column of a matrix?
minimum(@view x[:,2])
Seems reasonable but checking to see if I am missing something.
Yeah. That’s good.
using LoopVectorization, BenchmarkTools
function vminimum(x)
acc = typemax(eltype(x))
@turbo for i in eachindex(x)
acc = min(acc, x[i])
end
acc
end
x = rand(128,3);
@btime minimum(@view $x[:,2])
@btime vreduce(min, @view $x[:,2])
@btime vminimum(@view $x[:,2])
I get:
julia> @btime minimum(@view $x[:,2])
247.608 ns (0 allocations: 0 bytes)
0.010916262944919874
julia> @btime vreduce(min, @view $x[:,2])
12.258 ns (0 allocations: 0 bytes)
0.010916262944919874
julia> @btime vminimum(@view $x[:,2])
12.527 ns (0 allocations: 0 bytes)
0.010916262944919874
julia> versioninfo()
Julia Version 1.8.0-DEV.310
Commit cb30aa7a08* (2021-08-06 03:11 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
mimimum has much lower compile time latency.
Why is this faster than minimum is just Julia missing @inbounds?
LLVM doesn’t know how to SIMD min.
But it does help a lot:
julia> function minimum_simd(x)
acc = typemax(eltype(x))
@inbounds @simd for i in eachindex(x)
acc = Base.FastMath.min_fast(acc, x[i])
end
acc
end
minimum_simd (generic function with 1 method)
julia> @btime minimum_simd(@view $x[:,2])
66.216 ns (0 allocations: 0 bytes)
0.010916262944919874
ASM:
vminimum:
L160:
vminpd zmm7, zmm7, zmmword ptr [rax + 8*rdx - 448]
vminpd zmm6, zmm6, zmmword ptr [rax + 8*rdx - 384]
vminpd zmm4, zmm4, zmmword ptr [rax + 8*rdx - 320]
vminpd zmm2, zmm2, zmmword ptr [rax + 8*rdx - 256]
vminpd zmm5, zmm5, zmmword ptr [rax + 8*rdx - 192]
vminpd zmm3, zmm3, zmmword ptr [rax + 8*rdx - 128]
vminpd zmm1, zmm1, zmmword ptr [rax + 8*rdx - 64]
vminpd zmm0, zmm0, zmmword ptr [rax + 8*rdx]
add rdx, 64
cmp rdi, rdx
jne L160
minimum_simd, not actually simd or unrolled:
L48:
vminsd xmm0, xmm0, qword ptr [rcx + 8*rdx]
inc rdx
cmp rax, rdx
jne L48
Thus relative performance gets worse for larger yet still pre-memory-bound arrays:
julia> x = rand(512,3);
julia> @btime minimum_simd(@view $x[:,2])
395.990 ns (0 allocations: 0 bytes)
0.002170067911323348
julia> @btime vminimum(@view $x[:,2])
23.335 ns (0 allocations: 0 bytes)
0.002170067911323348
We really should fix this. Does it require LLVM changes to fix?
julia> minimum_simd(x)
remark: simdloop.jl:75:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
remark: simdloop.jl:75:0: loop not vectorized
0.004311995252092693
I don’t know LLVM’s internals, but maybe there’s just a list of reductions somewhere, and min and max are missing an entry?
Godbolt, showing GCC can do it.
Could there be a VectorizedArray that implements @turbo versions of Base functions?
In this specific example, it seems like we could do it for regular old Arrays once we fix an LLVM bug. That said, in general, this should be pretty easy to write.
StrideArrays.jl is still a bit of a WIP, but it’s trying to do that:
julia> using StrideArrays
julia> x = StrideArray{Float64}(undef, (128, 3)); x .= rand.();
julia> xstatic = @StrideArray rand(128,3);
julia> xpartiallystatic = StrideArray{Float64}(undef, (StaticInt(128), 3)); xpartiallystatic .= rand.();
julia> y = rand(128,3);
julia> @btime minimum(@view $y[:,2])
355.231 ns (0 allocations: 0 bytes)
0.017097150232982306
julia> @btime minimum(@view $x[:,2])
16.872 ns (0 allocations: 0 bytes)
0.4937004891138578
julia> @btime minimum(@view $xstatic[:,2])
11.496 ns (0 allocations: 0 bytes)
0.00474131798286026
julia> @btime minimum(@view $xpartiallystatic[:,2])
18.281 ns (0 allocations: 0 bytes)
0.7472262476315941
Still looking at LLVM src, but so far I found this for AArch64:
bool AArch64TTIImpl::isLegalToVectorizeReduction(
const RecurrenceDescriptor &RdxDesc, ElementCount VF) const {
if (!VF.isScalable())
return true;
Type *Ty = RdxDesc.getRecurrenceType();
if (Ty->isBFloatTy() || !isElementTypeLegalForScalableVector(Ty))
return false;
switch (RdxDesc.getRecurrenceKind()) {
case RecurKind::Add:
case RecurKind::FAdd:
case RecurKind::And:
case RecurKind::Or:
case RecurKind::Xor:
case RecurKind::SMin:
case RecurKind::SMax:
case RecurKind::UMin:
case RecurKind::UMax:
case RecurKind::FMin:
case RecurKind::FMax:
return true;
default:
return false;
}
}
FMin and FMax are listed among legal reductions.
It doesn’t vectorize on my M1.
L36:
ldr d1, [x9], #8
fcmp d0, d1
fcsel d0, d0, d1, mi
subs x8, x8, #1 ; =1
b.ne L36
To actually track things down efficiently would probably require using something like gdb to try and step through what it’s doing?