Speeding up creation of maximum mask array

Many thanks, we are down to 628.492 μs now.

When I try unroll=(1,6) with a tuple, I get this error:

**ERROR:** MethodError: no method matching bitstore!(::VectorizationBase.PackedStridedBitPointer{1,2}, ::Mask{8,UInt8}, ::SVec{8,Int32})
Stacktrace:
  [1] vnoaliasstore!(ptr::VectorizationBase.PackedStridedBitPointer{1,2}, v::Mask{8,UInt8}, i::Tuple{VectorizationBase.Static{0},VectorizationBase._MM{8,VectorizationBase.Static{0}}})
    @ VectorizationBase ~/.julia/packages/VectorizationBase/kIoqa/src/masks.jl:424

In a loop like this, is there any way to insist that seen really remains an integer, not an SVec?

   @avx unroll=4 for c in axes(x,2)
        seen = 0
        for r in axes(x,1)
            flag = onlyone(x[r,c] == y[c], seen)
            mask[r,c] = flag
            seen += any(flag)
        end
    end

I tried for instance moving @avx to the inner loop, but then I get

ERROR: MethodError: no method matching subsetview(::VectorizationBase.PackedStridedBitPointer{1,2}, ::Val{2}, ::Int64)

This was Threads.nthreads() == 4, possibly unwisely, on a 2-core laptop.

2 Likes