Minimum of a column of a matrix

stephenll · August 6, 2021, 6:28pm

Is the following the best (minimize allocations and speed) way to get the minimum of the second column of a matrix?

minimum(@view x[:,2])

Seems reasonable but checking to see if I am missing something.

Oscar_Smith · August 6, 2021, 6:46pm

Yeah. That’s good.

Elrod · August 7, 2021, 12:48am

using LoopVectorization, BenchmarkTools
function vminimum(x)
    acc = typemax(eltype(x))
    @turbo for i in eachindex(x)
        acc = min(acc, x[i])
    end
    acc
end
x = rand(128,3);
@btime minimum(@view $x[:,2])
@btime vreduce(min, @view $x[:,2])
@btime vminimum(@view $x[:,2])

I get:

julia> @btime minimum(@view $x[:,2])
  247.608 ns (0 allocations: 0 bytes)
0.010916262944919874

julia> @btime vreduce(min, @view $x[:,2])
  12.258 ns (0 allocations: 0 bytes)
0.010916262944919874

julia> @btime vminimum(@view $x[:,2])
  12.527 ns (0 allocations: 0 bytes)
0.010916262944919874

julia> versioninfo()
Julia Version 1.8.0-DEV.310
Commit cb30aa7a08* (2021-08-06 03:11 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)

mimimum has much lower compile time latency.

Oscar_Smith · August 7, 2021, 12:50am

Why is this faster than minimum is just Julia missing @inbounds?

Elrod · August 7, 2021, 12:50am

LLVM doesn’t know how to SIMD min.
But it does help a lot:

julia> function minimum_simd(x)
           acc = typemax(eltype(x))
           @inbounds @simd for i in eachindex(x)
               acc = Base.FastMath.min_fast(acc, x[i])
           end
           acc
       end
minimum_simd (generic function with 1 method)

julia> @btime minimum_simd(@view $x[:,2])
  66.216 ns (0 allocations: 0 bytes)
0.010916262944919874

ASM:
vminimum:

L160:
        vminpd  zmm7, zmm7, zmmword ptr [rax + 8*rdx - 448]
        vminpd  zmm6, zmm6, zmmword ptr [rax + 8*rdx - 384]
        vminpd  zmm4, zmm4, zmmword ptr [rax + 8*rdx - 320]
        vminpd  zmm2, zmm2, zmmword ptr [rax + 8*rdx - 256]
        vminpd  zmm5, zmm5, zmmword ptr [rax + 8*rdx - 192]
        vminpd  zmm3, zmm3, zmmword ptr [rax + 8*rdx - 128]
        vminpd  zmm1, zmm1, zmmword ptr [rax + 8*rdx - 64]
        vminpd  zmm0, zmm0, zmmword ptr [rax + 8*rdx]
        add     rdx, 64
        cmp     rdi, rdx
        jne     L160

minimum_simd, not actually simd or unrolled:

L48:
        vminsd  xmm0, xmm0, qword ptr [rcx + 8*rdx]
        inc     rdx
        cmp     rax, rdx
        jne     L48

Thus relative performance gets worse for larger yet still pre-memory-bound arrays:

julia> x = rand(512,3);

julia> @btime minimum_simd(@view $x[:,2])
  395.990 ns (0 allocations: 0 bytes)
0.002170067911323348

julia> @btime vminimum(@view $x[:,2])
  23.335 ns (0 allocations: 0 bytes)
0.002170067911323348

Oscar_Smith · August 7, 2021, 1:03am

We really should fix this. Does it require LLVM changes to fix?

Elrod · August 7, 2021, 1:19am

julia> minimum_simd(x)
remark: simdloop.jl:75:0: loop not vectorized: value that could not be identified as reduction is used outside the loop
remark: simdloop.jl:75:0: loop not vectorized
0.004311995252092693

I don’t know LLVM’s internals, but maybe there’s just a list of reductions somewhere, and min and max are missing an entry?
Godbolt, showing GCC can do it.

jzr · August 7, 2021, 2:29am

Could there be a VectorizedArray that implements @turbo versions of Base functions?

Oscar_Smith · August 7, 2021, 2:55am

In this specific example, it seems like we could do it for regular old Arrays once we fix an LLVM bug. That said, in general, this should be pretty easy to write.

Elrod · August 7, 2021, 3:21am

StrideArrays.jl is still a bit of a WIP, but it’s trying to do that:

julia> using StrideArrays

julia> x = StrideArray{Float64}(undef, (128, 3)); x .= rand.();

julia> xstatic = @StrideArray rand(128,3);

julia> xpartiallystatic = StrideArray{Float64}(undef, (StaticInt(128), 3)); xpartiallystatic .= rand.();

julia> y = rand(128,3);

julia> @btime minimum(@view $y[:,2])
  355.231 ns (0 allocations: 0 bytes)
0.017097150232982306

julia> @btime minimum(@view $x[:,2])
  16.872 ns (0 allocations: 0 bytes)
0.4937004891138578

julia> @btime minimum(@view $xstatic[:,2])
  11.496 ns (0 allocations: 0 bytes)
0.00474131798286026

julia> @btime minimum(@view $xpartiallystatic[:,2])
  18.281 ns (0 allocations: 0 bytes)
0.7472262476315941

Elrod · August 7, 2021, 3:30am

Still looking at LLVM src, but so far I found this for AArch64:

bool AArch64TTIImpl::isLegalToVectorizeReduction(
     const RecurrenceDescriptor &RdxDesc, ElementCount VF) const {
   if (!VF.isScalable())
     return true;
  
   Type *Ty = RdxDesc.getRecurrenceType();
   if (Ty->isBFloatTy() || !isElementTypeLegalForScalableVector(Ty))
     return false;
  
   switch (RdxDesc.getRecurrenceKind()) {
   case RecurKind::Add:
   case RecurKind::FAdd:
   case RecurKind::And:
   case RecurKind::Or:
   case RecurKind::Xor:
   case RecurKind::SMin:
   case RecurKind::SMax:
   case RecurKind::UMin:
   case RecurKind::UMax:
   case RecurKind::FMin:
   case RecurKind::FMax:
     return true;
   default:
     return false;
   }
 }

FMin and FMax are listed among legal reductions.
It doesn’t vectorize on my M1.

L36:
        ldr     d1, [x9], #8
        fcmp    d0, d1
        fcsel   d0, d0, d1, mi
        subs    x8, x8, #1                      ; =1
        b.ne    L36

To actually track things down efficiently would probably require using something like gdb to try and step through what it’s doing?

Topic		Replies	Views
Faster `findmin` without LoopVectorization.jl Performance question	35	905	May 14, 2025
I think LLVM can do better here General Usage performance , llvm	4	630	June 4, 2022
Subtracting row minima Performance	9	2247	March 30, 2019
How to use minimum and maximum for a Vector{SVector{3}} Performance question	6	10735	July 12, 2019
Matrix multiplication but sum of min() instead of sum of product General Usage statistics , network-analysis	7	793	September 15, 2021

Minimum of a column of a matrix

Related topics