Minimum of a column of a matrix

Is the following the best (minimize allocations and speed) way to get the minimum of the second column of a matrix?

minimum(@view x[:,2])

Seems reasonable but checking to see if I am missing something.

Yeah. That’s good.

2 Likes
using LoopVectorization, BenchmarkTools
function vminimum(x)
    acc = typemax(eltype(x))
    @turbo for i in eachindex(x)
        acc = min(acc, x[i])
    end
    acc
end
x = rand(128,3);
@btime minimum(@view $x[:,2])
@btime vreduce(min, @view $x[:,2])
@btime vminimum(@view $x[:,2])

I get:

julia> @btime minimum(@view $x[:,2])
  247.608 ns (0 allocations: 0 bytes)
0.010916262944919874

julia> @btime vreduce(min, @view $x[:,2])
  12.258 ns (0 allocations: 0 bytes)
0.010916262944919874

julia> @btime vminimum(@view $x[:,2])
  12.527 ns (0 allocations: 0 bytes)
0.010916262944919874

julia> versioninfo()
Julia Version 1.8.0-DEV.310
Commit cb30aa7a08* (2021-08-06 03:11 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)

mimimum has much lower compile time latency.

2 Likes

Why is this faster than minimum is just Julia missing @inbounds?

LLVM doesn’t know how to SIMD min.
But it does help a lot:

julia> function minimum_simd(x)
           acc = typemax(eltype(x))
           @inbounds @simd for i in eachindex(x)
               acc = Base.FastMath.min_fast(acc, x[i])
           end
           acc
       end
minimum_simd (generic function with 1 method)

julia> @btime minimum_simd(@view $x[:,2])
  66.216 ns (0 allocations: 0 bytes)
0.010916262944919874

ASM:
vminimum:

L160:
        vminpd  zmm7, zmm7, zmmword ptr [rax + 8*rdx - 448]
        vminpd  zmm6, zmm6, zmmword ptr [rax + 8*rdx - 384]
        vminpd  zmm4, zmm4, zmmword ptr [rax + 8*rdx - 320]
        vminpd  zmm2, zmm2, zmmword ptr [rax + 8*rdx - 256]
        vminpd  zmm5, zmm5, zmmword ptr [rax + 8*rdx - 192]
        vminpd  zmm3, zmm3, zmmword ptr [rax + 8*rdx - 128]
        vminpd  zmm1, zmm1, zmmword ptr [rax + 8*rdx - 64]
        vminpd  zmm0, zmm0, zmmword ptr [rax + 8*rdx]
        add     rdx, 64
        cmp     rdi, rdx
        jne     L160

minimum_simd, not actually simd or unrolled:

L48:
        vminsd  xmm0, xmm0, qword ptr [rcx + 8*rdx]
        inc     rdx
        cmp     rax, rdx
        jne     L48

Thus relative performance gets worse for larger yet still pre-memory-bound arrays:

julia> x = rand(512,3);

julia> @btime minimum_simd(@view $x[:,2])
  395.990 ns (0 allocations: 0 bytes)
0.002170067911323348

julia> @btime vminimum(@view $x[:,2])
  23.335 ns (0 allocations: 0 bytes)
0.002170067911323348
2 Likes

We really should fix this. Does it require LLVM changes to fix?

julia> minimum_simd(x)
remark: simdloop.jl:75:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
remark: simdloop.jl:75:0: loop not vectorized
0.004311995252092693

I don’t know LLVM’s internals, but maybe there’s just a list of reductions somewhere, and min and max are missing an entry?
Godbolt, showing GCC can do it.

1 Like

Could there be a VectorizedArray that implements @turbo versions of Base functions?

1 Like

In this specific example, it seems like we could do it for regular old Arrays once we fix an LLVM bug. That said, in general, this should be pretty easy to write.

StrideArrays.jl is still a bit of a WIP, but it’s trying to do that:

julia> using StrideArrays

julia> x = StrideArray{Float64}(undef, (128, 3)); x .= rand.();

julia> xstatic = @StrideArray rand(128,3);

julia> xpartiallystatic = StrideArray{Float64}(undef, (StaticInt(128), 3)); xpartiallystatic .= rand.();

julia> y = rand(128,3);

julia> @btime minimum(@view $y[:,2])
  355.231 ns (0 allocations: 0 bytes)
0.017097150232982306

julia> @btime minimum(@view $x[:,2])
  16.872 ns (0 allocations: 0 bytes)
0.4937004891138578

julia> @btime minimum(@view $xstatic[:,2])
  11.496 ns (0 allocations: 0 bytes)
0.00474131798286026

julia> @btime minimum(@view $xpartiallystatic[:,2])
  18.281 ns (0 allocations: 0 bytes)
0.7472262476315941
3 Likes

Still looking at LLVM src, but so far I found this for AArch64:

bool AArch64TTIImpl::isLegalToVectorizeReduction(
     const RecurrenceDescriptor &RdxDesc, ElementCount VF) const {
   if (!VF.isScalable())
     return true;
  
   Type *Ty = RdxDesc.getRecurrenceType();
   if (Ty->isBFloatTy() || !isElementTypeLegalForScalableVector(Ty))
     return false;
  
   switch (RdxDesc.getRecurrenceKind()) {
   case RecurKind::Add:
   case RecurKind::FAdd:
   case RecurKind::And:
   case RecurKind::Or:
   case RecurKind::Xor:
   case RecurKind::SMin:
   case RecurKind::SMax:
   case RecurKind::UMin:
   case RecurKind::UMax:
   case RecurKind::FMin:
   case RecurKind::FMax:
     return true;
   default:
     return false;
   }
 }

FMin and FMax are listed among legal reductions.
It doesn’t vectorize on my M1.

L36:
        ldr     d1, [x9], #8
        fcmp    d0, d1
        fcsel   d0, d0, d1, mi
        subs    x8, x8, #1                      ; =1
        b.ne    L36

To actually track things down efficiently would probably require using something like gdb to try and step through what it’s doing?

1 Like