Is the following the best (minimize allocations and speed) way to get the minimum of the second column of a matrix?
minimum(@view x[:,2])
Seems reasonable but checking to see if I am missing something.
Is the following the best (minimize allocations and speed) way to get the minimum of the second column of a matrix?
minimum(@view x[:,2])
Seems reasonable but checking to see if I am missing something.
Yeah. That’s good.
using LoopVectorization, BenchmarkTools
function vminimum(x)
acc = typemax(eltype(x))
@turbo for i in eachindex(x)
acc = min(acc, x[i])
end
acc
end
x = rand(128,3);
@btime minimum(@view $x[:,2])
@btime vreduce(min, @view $x[:,2])
@btime vminimum(@view $x[:,2])
I get:
julia> @btime minimum(@view $x[:,2])
247.608 ns (0 allocations: 0 bytes)
0.010916262944919874
julia> @btime vreduce(min, @view $x[:,2])
12.258 ns (0 allocations: 0 bytes)
0.010916262944919874
julia> @btime vminimum(@view $x[:,2])
12.527 ns (0 allocations: 0 bytes)
0.010916262944919874
julia> versioninfo()
Julia Version 1.8.0-DEV.310
Commit cb30aa7a08* (2021-08-06 03:11 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
mimimum
has much lower compile time latency.
Why is this faster than minimum
is just Julia missing @inbounds
?
LLVM doesn’t know how to SIMD min
.
But it does help a lot:
julia> function minimum_simd(x)
acc = typemax(eltype(x))
@inbounds @simd for i in eachindex(x)
acc = Base.FastMath.min_fast(acc, x[i])
end
acc
end
minimum_simd (generic function with 1 method)
julia> @btime minimum_simd(@view $x[:,2])
66.216 ns (0 allocations: 0 bytes)
0.010916262944919874
ASM:
vminimum
:
L160:
vminpd zmm7, zmm7, zmmword ptr [rax + 8*rdx - 448]
vminpd zmm6, zmm6, zmmword ptr [rax + 8*rdx - 384]
vminpd zmm4, zmm4, zmmword ptr [rax + 8*rdx - 320]
vminpd zmm2, zmm2, zmmword ptr [rax + 8*rdx - 256]
vminpd zmm5, zmm5, zmmword ptr [rax + 8*rdx - 192]
vminpd zmm3, zmm3, zmmword ptr [rax + 8*rdx - 128]
vminpd zmm1, zmm1, zmmword ptr [rax + 8*rdx - 64]
vminpd zmm0, zmm0, zmmword ptr [rax + 8*rdx]
add rdx, 64
cmp rdi, rdx
jne L160
minimum_simd
, not actually simd or unrolled:
L48:
vminsd xmm0, xmm0, qword ptr [rcx + 8*rdx]
inc rdx
cmp rax, rdx
jne L48
Thus relative performance gets worse for larger yet still pre-memory-bound arrays:
julia> x = rand(512,3);
julia> @btime minimum_simd(@view $x[:,2])
395.990 ns (0 allocations: 0 bytes)
0.002170067911323348
julia> @btime vminimum(@view $x[:,2])
23.335 ns (0 allocations: 0 bytes)
0.002170067911323348
We really should fix this. Does it require LLVM changes to fix?
julia> minimum_simd(x)
remark: simdloop.jl:75:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
remark: simdloop.jl:75:0: loop not vectorized
0.004311995252092693
I don’t know LLVM’s internals, but maybe there’s just a list of reductions somewhere, and min
and max
are missing an entry?
Godbolt, showing GCC can do it.
Could there be a VectorizedArray
that implements @turbo
versions of Base functions?
In this specific example, it seems like we could do it for regular old Array
s once we fix an LLVM bug. That said, in general, this should be pretty easy to write.
StrideArrays.jl is still a bit of a WIP, but it’s trying to do that:
julia> using StrideArrays
julia> x = StrideArray{Float64}(undef, (128, 3)); x .= rand.();
julia> xstatic = @StrideArray rand(128,3);
julia> xpartiallystatic = StrideArray{Float64}(undef, (StaticInt(128), 3)); xpartiallystatic .= rand.();
julia> y = rand(128,3);
julia> @btime minimum(@view $y[:,2])
355.231 ns (0 allocations: 0 bytes)
0.017097150232982306
julia> @btime minimum(@view $x[:,2])
16.872 ns (0 allocations: 0 bytes)
0.4937004891138578
julia> @btime minimum(@view $xstatic[:,2])
11.496 ns (0 allocations: 0 bytes)
0.00474131798286026
julia> @btime minimum(@view $xpartiallystatic[:,2])
18.281 ns (0 allocations: 0 bytes)
0.7472262476315941
Still looking at LLVM src, but so far I found this for AArch64:
bool AArch64TTIImpl::isLegalToVectorizeReduction(
const RecurrenceDescriptor &RdxDesc, ElementCount VF) const {
if (!VF.isScalable())
return true;
Type *Ty = RdxDesc.getRecurrenceType();
if (Ty->isBFloatTy() || !isElementTypeLegalForScalableVector(Ty))
return false;
switch (RdxDesc.getRecurrenceKind()) {
case RecurKind::Add:
case RecurKind::FAdd:
case RecurKind::And:
case RecurKind::Or:
case RecurKind::Xor:
case RecurKind::SMin:
case RecurKind::SMax:
case RecurKind::UMin:
case RecurKind::UMax:
case RecurKind::FMin:
case RecurKind::FMax:
return true;
default:
return false;
}
}
FMin
and FMax
are listed among legal reductions.
It doesn’t vectorize on my M1.
L36:
ldr d1, [x9], #8
fcmp d0, d1
fcsel d0, d0, d1, mi
subs x8, x8, #1 ; =1
b.ne L36
To actually track things down efficiently would probably require using something like gdb to try and step through what it’s doing?