Is the following the best (minimize allocations and speed) way to get the minimum of the second column of a matrix?
minimum(@view x[:,2])
Seems reasonable but checking to see if I am missing something.
Is the following the best (minimize allocations and speed) way to get the minimum of the second column of a matrix?
minimum(@view x[:,2])
Seems reasonable but checking to see if I am missing something.
Yeah. That’s good.
using LoopVectorization, BenchmarkTools
function vminimum(x)
    acc = typemax(eltype(x))
    @turbo for i in eachindex(x)
        acc = min(acc, x[i])
    end
    acc
end
x = rand(128,3);
@btime minimum(@view $x[:,2])
@btime vreduce(min, @view $x[:,2])
@btime vminimum(@view $x[:,2])
I get:
julia> @btime minimum(@view $x[:,2])
  247.608 ns (0 allocations: 0 bytes)
0.010916262944919874
julia> @btime vreduce(min, @view $x[:,2])
  12.258 ns (0 allocations: 0 bytes)
0.010916262944919874
julia> @btime vminimum(@view $x[:,2])
  12.527 ns (0 allocations: 0 bytes)
0.010916262944919874
julia> versioninfo()
Julia Version 1.8.0-DEV.310
Commit cb30aa7a08* (2021-08-06 03:11 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
mimimum has much lower compile time latency.
Why is this faster than minimum is just Julia missing @inbounds?
LLVM doesn’t know how to SIMD min.
But it does help a lot:
julia> function minimum_simd(x)
           acc = typemax(eltype(x))
           @inbounds @simd for i in eachindex(x)
               acc = Base.FastMath.min_fast(acc, x[i])
           end
           acc
       end
minimum_simd (generic function with 1 method)
julia> @btime minimum_simd(@view $x[:,2])
  66.216 ns (0 allocations: 0 bytes)
0.010916262944919874
ASM:
vminimum:
L160:
        vminpd  zmm7, zmm7, zmmword ptr [rax + 8*rdx - 448]
        vminpd  zmm6, zmm6, zmmword ptr [rax + 8*rdx - 384]
        vminpd  zmm4, zmm4, zmmword ptr [rax + 8*rdx - 320]
        vminpd  zmm2, zmm2, zmmword ptr [rax + 8*rdx - 256]
        vminpd  zmm5, zmm5, zmmword ptr [rax + 8*rdx - 192]
        vminpd  zmm3, zmm3, zmmword ptr [rax + 8*rdx - 128]
        vminpd  zmm1, zmm1, zmmword ptr [rax + 8*rdx - 64]
        vminpd  zmm0, zmm0, zmmword ptr [rax + 8*rdx]
        add     rdx, 64
        cmp     rdi, rdx
        jne     L160
minimum_simd, not actually simd or unrolled:
L48:
        vminsd  xmm0, xmm0, qword ptr [rcx + 8*rdx]
        inc     rdx
        cmp     rax, rdx
        jne     L48
Thus relative performance gets worse for larger yet still pre-memory-bound arrays:
julia> x = rand(512,3);
julia> @btime minimum_simd(@view $x[:,2])
  395.990 ns (0 allocations: 0 bytes)
0.002170067911323348
julia> @btime vminimum(@view $x[:,2])
  23.335 ns (0 allocations: 0 bytes)
0.002170067911323348
We really should fix this. Does it require LLVM changes to fix?
julia> minimum_simd(x)
remark: simdloop.jl:75:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
remark: simdloop.jl:75:0: loop not vectorized
0.004311995252092693
I don’t know LLVM’s internals, but maybe there’s just a list of reductions somewhere, and min and max are missing an entry?
Godbolt, showing GCC can do it.
Could there be a VectorizedArray that implements @turbo versions of Base functions?
In this specific example, it seems like we could do it for regular old Arrays once we fix an LLVM bug. That said, in general, this should be pretty easy to write.
StrideArrays.jl is still a bit of a WIP, but it’s trying to do that:
julia> using StrideArrays
julia> x = StrideArray{Float64}(undef, (128, 3)); x .= rand.();
julia> xstatic = @StrideArray rand(128,3);
julia> xpartiallystatic = StrideArray{Float64}(undef, (StaticInt(128), 3)); xpartiallystatic .= rand.();
julia> y = rand(128,3);
julia> @btime minimum(@view $y[:,2])
  355.231 ns (0 allocations: 0 bytes)
0.017097150232982306
julia> @btime minimum(@view $x[:,2])
  16.872 ns (0 allocations: 0 bytes)
0.4937004891138578
julia> @btime minimum(@view $xstatic[:,2])
  11.496 ns (0 allocations: 0 bytes)
0.00474131798286026
julia> @btime minimum(@view $xpartiallystatic[:,2])
  18.281 ns (0 allocations: 0 bytes)
0.7472262476315941
Still looking at LLVM src, but so far I found this for AArch64:
bool AArch64TTIImpl::isLegalToVectorizeReduction(
     const RecurrenceDescriptor &RdxDesc, ElementCount VF) const {
   if (!VF.isScalable())
     return true;
  
   Type *Ty = RdxDesc.getRecurrenceType();
   if (Ty->isBFloatTy() || !isElementTypeLegalForScalableVector(Ty))
     return false;
  
   switch (RdxDesc.getRecurrenceKind()) {
   case RecurKind::Add:
   case RecurKind::FAdd:
   case RecurKind::And:
   case RecurKind::Or:
   case RecurKind::Xor:
   case RecurKind::SMin:
   case RecurKind::SMax:
   case RecurKind::UMin:
   case RecurKind::UMax:
   case RecurKind::FMin:
   case RecurKind::FMax:
     return true;
   default:
     return false;
   }
 }
FMin and FMax are listed among legal reductions.
It doesn’t vectorize on my M1.
L36:
        ldr     d1, [x9], #8
        fcmp    d0, d1
        fcsel   d0, d0, d1, mi
        subs    x8, x8, #1                      ; =1
        b.ne    L36
To actually track things down efficiently would probably require using something like gdb to try and step through what it’s doing?