Speeding up creation of maximum mask array

I’ve a program where it needs a mask array which is created by filling all column-wise maxima positions with 1 and others with 0. I want the mask to have only one 1 filled per column.

I can do this by

x = rand(3, 3);
max_ = maximum(x,dims=1)
mask = x .== max_

The problem with above method is, if x has non-unique maxima in its columns then mask will have more than one 1s in its respective columns.

The below function checks and fills 0 in place of the extra 1s.

function remove_extra_1s(mat)
    @inbounds for i ∈ 1:size(mat, 2)
        count = 0
        @inbounds for j ∈ 1:size(mat, 1)
            if mat[j, i] == 1
                count = count + 1
                if count > 1
                    mat[j, i] = 0
                end
            end
        end
    end
    return mat
end

I’ve also used the findmax function to get the mask

function findmax_mask(x)
    mask = zero(x)
    max_, indices = findmax(x, dims=1)
    mask[indices] .= 1
    return max_, mask
end

Performance timings:

x = rand(1_000, 1_000);

findmax_mask(x)[2] == remove_extra_1s((x .== maximum(x, dims=1))) # true

@btime findmax_mask($x) 
 4.430 ms (8 allocations: 7.65 MiB)

@btime remove_extra_1s(($x .== maximum($x, dims=1)));
  2.176 ms (7 allocations: 134.44 KiB)

I’ve also used LoopVectorization to write a faster maximum function. I feel that remove_extra_1s function is the bottle neck.

Is there any way I can speed up the remove_extra_1s function? Is there any other faster different way to get the mask and max_?

You are storing your mask as a matrix of floats here, for the given inputs. This doesn’t make a lot of difference in speed, but it does for memory, and also it doesn’t make sense.

Preferably use an array of Bool or a BitArray.

1 Like

Warning: very brittle and ugly code, could have been done better.

using Random
using BenchmarkTools

function findmax_mask(x)
    mask = zero(x)
    max_, indices = findmax(x, dims=1)
    mask[indices] .= 1
    return max_, mask
end

function remove_extra_1s(mat)
    @inbounds for i ∈ 1:size(mat, 2)
        count = 0
        @inbounds for j ∈ 1:size(mat, 1)
            if mat[j, i] == 1
                count = count + 1
                if count > 1
                    mat[j, i] = 0
                end
            end
        end
    end
    return mat
end

function myfindmax(m, col)
    v = m[1, col]
    ix = 1
    @inbounds for row in 2:size(m, 1)
        if m[row, col] > v
            ix = row
            v = m[row, col]
        end
    end
    return ix
end

function my_mask_creator(a)
    mask = zeros(Bool, axes(a))
    @inbounds for col in 1:size(a, 2)
        mask[myfindmax(a, col), col] = true
    end
    return mask
end

function benchmark()
    Random.seed!(0)
    x = rand(1_000, 1_000);

    @assert findmax_mask(x)[2] == remove_extra_1s((x .== maximum(x, dims=1))) # true
    @assert findmax_mask(x)[2] == my_mask_creator(x)

    @btime findmax_mask($x)
    @btime remove_extra_1s(($x .== maximum($x, dims=1)));
    @btime my_mask_creator($x)
    nothing
end

benchmark()

gives

  4.782 ms (9 allocations: 7.65 MiB)
  2.648 ms (7 allocations: 134.44 KiB)
  1.214 ms (2 allocations: 976.70 KiB)
1 Like

Why do you say its brittle? Is it because of hard-coding loops?

I say it is brittle because:

  1. myfindmax and my_mask_creator do not check on the type of m or a;
  2. myfindmax expects m to not be empty (otherwise throwing a generic bounds error);
  3. myfindmax and my_mask_creator expect indexes to start at one (what is not always true);
1 Like

This is quite compact and about 15% faster than Henrique’s function.

function findmax_mask2(x)
    c = 0
    mask = falses(size(x))
    for col in eachcol(x)
        mask[findfirst(==(maximum(col)),col),c+=1] = 1         
    end 
    return mask
end

function benchmark()
    Random.seed!(0)
    x = rand(1_000, 1_000)
    @assert findmax_mask2(x) == my_mask_creator(x)
    @btime my_mask_creator($x)
    @btime findmax_mask2($x)
    nothing
end

benchmark()
  1.170 ms (2 allocations: 976.70 KiB)
  1.040 ms (3 allocations: 122.25 KiB)
4 Likes

Very compact indeed, but :rofl: at the hidden c+=1! You could enumerate eachcol to get this effect as well (saving another line, to boot!). I think the most “robust” way to do it is probably with the axes function though, to account for the non-1 based indexing edge-case @Henrique_Becker mentioned. I think it’s usually pretty safe to ignore this edge case though :grimacing:

2 Likes

Thank you for all the implementations, I’ve learnt something from each of them.

I’ve found that using multithreading is much slower than using @inboudns. Why is that? Is it taking more time to sync?

Times of some attempts, note that my x here has some repeated elements:

julia> Random.seed!(42); x = rand(randn(1_000), 1_000, 1_000); # some repeats

julia> sum(x .== maximum(x, dims=1), dims=1)                                 
1×1000 Array{Int64,2}:
 1  1  2  2  1  2  1  4  2  1  1  1  …  2  1  2  1  3  3  2  4  2  1  3  3

julia> @btime vreduce(max, $x, dims=1);
  193.752 μs (1 allocation: 7.94 KiB)

julia> @btime findmax_mask($x);
  5.286 ms (4 allocations: 7.65 MiB)

julia> @btime remove_extra_1s(($x .== maximum($x, dims=1))); # from vkv
  3.063 ms (8 allocations: 134.48 KiB)

julia> @btime my_mask_creator($x); # from Henrique_Becker
  1.263 ms (2 allocations: 976.70 KiB)

julia> @btime findmax_mask2($x); # from Seif_Shebl --> 1.667 ms no repeats
  2.308 ms (3 allocations: 122.25 KiB)

julia> @btime findmax_mask3($x); # --> 1.149 ms no repeats
  998.595 μs (4 allocations: 130.19 KiB)

julia> @btime findmax_mask3_threads($x); # multi-threaded
  488.894 μs (95 allocations: 136.55 KiB)

# Attempts at a vectorised version:
julia> @btime findmax_mask4($x); # --> 1.329 ms no repeats
  1.328 ms (3 allocations: 984.64 KiB)

julia> @btime findmax_mask6($x);
  797.529 μs (4 allocations: 130.19 KiB)

# Alternative x, some times change, marked --> 
julia> Random.seed!(0); x = rand(1_000, 1_000); # no repeats

julia> sum(x .== maximum(x, dims=1), dims=1)
1×1000 Array{Int64,2}:
 1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1

julia> VERSION                                                               
v"1.5.0"

All of these are quite far from vreduce, which made me think it ought to be possible to do better. findmax_mask6 is an attempt to do this, with LoopVectorization, but it’s not actually faster. (Edit – some improvements.) But perhaps @Elrod knows a better way?

using LoopVectorization, Tullio, Random, BenchmarkTools

function findmax_mask3(x::Matrix)
    y = vreduce(max, x, dims=1)
    mask = falses(size(x)) # BitArray{2}
    # mask = fill(false, size(x)) # Array{Bool,2} much the same speed
    for c in axes(x,2)
        @inbounds for r in axes(x,1)
            flag = x[r,c] == y[c]
            mask[r,c] = flag
            flag && break
        end
    end
    return mask
end

function findmax_mask3_threads(x::Matrix)
    @tullio (max) y[c] := x[r,c];
    mask = falses(size(x))
    Threads.@threads for c in axes(x,2)
        @inbounds for r in axes(x,1)
            flag = x[r,c] == y[c]
            mask[r,c] = flag
            flag && break # this prevents @simd
        end
    end
    return mask
end

function findmax_mask4(x::Matrix) # no branches
    y = vreduce(max, x, dims=1)
    # mask = falses(size(x)) # BitArray{2}
    mask = fill(false, size(x)) # Array{Bool,2} is faster
    @inbounds for c in axes(x,2)
        seen = false
        @simd for r in axes(x,1)
            flag = !seen & (x[r,c] == y[c])
            mask[r,c] = flag
            seen |= flag
        end
    end
    return mask
end

x0 = rand(1:5, 10,30) # small, many repeats
x0 = rand(rand(100), 100, 1000); # some repeats

findmax_mask3(x0) == findmax_mask3_threads(x0) == findmax_mask2(x0) # ok
findmax_mask4(x0) == findmax_mask2(x0) # ok

using LoopVectorization: vifelse
using LoopVectorization.VectorizationBase: SVec, vload, VectorizationBase, Mask
Mask(0x03) # Mask{8,Bool}<1, 1, 0, 0, 0, 0, 0, 0>
SVec{4,Int}(1,2,3,4) # SVec{4,Int64}<1, 2, 3, 4>

@inline onlyone(cond::Bool, seen::Int) = cond && iszero(seen)
@inline function onlyone(cond::Mask{W}, seen::Union{Int,SVec}) where {W}
    allzero(seen) || return zero(cond)
    return Mask(hibit(cond.u))
end
@inline function hibit(n::UInt16) # https://stackoverflow.com/questions/53161/find-the-highest-order-bit-in-c
    n |= (n >>  1);
    n |= (n >>  2);
    n |= (n >>  4);
    n |= (n >>  8);
    return n - (n >> 1)
end
@inline function hibit(n::UInt8)
    n |= (n >>  1);
    n |= (n >>  2);
    n |= (n >>  4);
    return n - (n >> 1)
end

@inline allzero(seen::Int) = iszero(seen)
# @inline allzero(seen::SVec{N,Int}) where {N} = all(ntuple(i -> iszero(seen.data[i].value), N))
@inline allzero(seen::SVec{N,Int}) where {N} = iszero((!iszero(seen)).u)

@inline anyone(cond::Bool) = cond
@inline anyone(cond::Mask) = cond != zero(cond)

onlyone(Mask(0x03), 0) # Mask{8,Bool}<0, 1, 0, 0, 0, 0, 0, 0>
anyone(Mask(0x03))

function findmax_mask6(x::Matrix)
    y = vreduce(max, x, dims=1)
    mask = falses(size(x)) # BitArray{2} # faster
    # mask = fill(false, size(x)) # Array{Bool,2}
    @avx for c in axes(x,2)
        seen = 0
        for r in axes(x,1)
            flag = onlyone(x[r,c] == y[c], seen)
            mask[r,c] = flag
            seen += anyone(flag)
        end
    end
    return mask
end

findmax_mask6(x0) == findmax_mask2(x0) # false, because it chooses the last not first
2 Likes

How many threads did you use? I find multithreading performance to be inconsistent with matrix sizes.

If you want a more general definition of hibit:

hibitv2(x) = (one(x) << (8sizeof(x)-1)) >>> leading_zeros(x)

or (using VectorizationBase 0.12.32; function existed in older versions, but I only just now optimized it)

using VectorizationBase: prevpow2

Now, check out the differences in assembly, origional:

# julia> @code_native debuginfo=:none syntax=:intel hibit(0x0c07)
        .text
        movzx   eax, di
        shr     eax
        or      eax, edi
        movzx   eax, ax
        mov     ecx, eax
        shr     ecx, 2
        or      ecx, eax
        mov     edx, ecx
        shr     edx, 4
        or      edx, ecx
        mov     eax, edx
        shr     eax, 8
        or      eax, edx
        mov     ecx, eax
        shr     ecx
        sub     eax, ecx
        ret
        nop     word ptr cs:[rax + rax]

Version 2:

# julia> @code_native debuginfo=:none syntax=:intel hibitv2(0x0c07)
        .text
        xor     ecx, ecx
        lzcnt   ax, di
        mov     edx, 32768
        shrx    eax, edx, eax
        cmovb   eax, ecx
        ret
        nop     word ptr cs:[rax + rax]
        nop

From VectorizationBase:

# julia> @code_native debuginfo=:none syntax=:intel prevpow2(0x0c07)
        .text
        lzcnt   ax, di
        mov     ecx, 32768
        shrx    eax, ecx, eax
        ret

Note that on recent architectures like Skylake lzcnt has 3 cycles of latency, instead of the 1 cycle of and/or/shift-type operations (or 2 for mov[zx]).
Still, this is a clear win.

Your anyone is optimal in terms of assembly, but any(::Mask) is already defined.

Another point is that @avx won’t unroll much when it doesn’t recognize the the functions. Probably simplest to manually try a few @avx unroll=(1,6), for example.

Finally, I’ll add a new BitArray-like type to PaddedMatrices soon.
If any array needs padding, it’s BitArrays. SIMD-CartesianIndexing into BitArrays is very costly because you have to do a lot of bit operations to load/store the correct vector. If we can guarantee that each column starts with a fresh byte, and not part way in between bytes, we can eliminate most of those operations.

3 Likes

Many thanks, we are down to 628.492 μs now.

When I try unroll=(1,6) with a tuple, I get this error:

**ERROR:** MethodError: no method matching bitstore!(::VectorizationBase.PackedStridedBitPointer{1,2}, ::Mask{8,UInt8}, ::SVec{8,Int32})
Stacktrace:
  [1] vnoaliasstore!(ptr::VectorizationBase.PackedStridedBitPointer{1,2}, v::Mask{8,UInt8}, i::Tuple{VectorizationBase.Static{0},VectorizationBase._MM{8,VectorizationBase.Static{0}}})
    @ VectorizationBase ~/.julia/packages/VectorizationBase/kIoqa/src/masks.jl:424

In a loop like this, is there any way to insist that seen really remains an integer, not an SVec?

   @avx unroll=4 for c in axes(x,2)
        seen = 0
        for r in axes(x,1)
            flag = onlyone(x[r,c] == y[c], seen)
            mask[r,c] = flag
            seen += any(flag)
        end
    end

I tried for instance moving @avx to the inner loop, but then I get

ERROR: MethodError: no method matching subsetview(::VectorizationBase.PackedStridedBitPointer{1,2}, ::Val{2}, ::Int64)

This was Threads.nthreads() == 4, possibly unwisely, on a 2-core laptop.

2 Likes