# Speeding up creation of maximum mask array

I’ve a program where it needs a mask array which is created by filling all column-wise maxima positions with 1 and others with 0. I want the mask to have only one `1` filled per column.

I can do this by

``````x = rand(3, 3);
max_ = maximum(x,dims=1)
``````

The problem with above method is, if x has non-unique maxima in its columns then mask will have more than one `1`s in its respective columns.

The below function checks and fills `0` in place of the extra `1`s.

``````function remove_extra_1s(mat)
@inbounds for i ∈ 1:size(mat, 2)
count = 0
@inbounds for j ∈ 1:size(mat, 1)
if mat[j, i] == 1
count = count + 1
if count > 1
mat[j, i] = 0
end
end
end
end
return mat
end
``````

I’ve also used the `findmax` function to get the mask

``````function findmax_mask(x)
max_, indices = findmax(x, dims=1)
end
``````

Performance timings:

``````x = rand(1_000, 1_000);

findmax_mask(x)[2] == remove_extra_1s((x .== maximum(x, dims=1))) # true

4.430 ms (8 allocations: 7.65 MiB)

@btime remove_extra_1s((\$x .== maximum(\$x, dims=1)));
2.176 ms (7 allocations: 134.44 KiB)

``````

I’ve also used LoopVectorization to write a faster `maximum` function. I feel that `remove_extra_1s` function is the bottle neck.

Is there any way I can speed up the `remove_extra_1s` function? Is there any other faster different way to get the `mask` and `max_`?

You are storing your mask as a matrix of floats here, for the given inputs. This doesn’t make a lot of difference in speed, but it does for memory, and also it doesn’t make sense.

Preferably use an array of `Bool` or a `BitArray`.

1 Like

Warning: very brittle and ugly code, could have been done better.

``````using Random
using BenchmarkTools

max_, indices = findmax(x, dims=1)
end

function remove_extra_1s(mat)
@inbounds for i ∈ 1:size(mat, 2)
count = 0
@inbounds for j ∈ 1:size(mat, 1)
if mat[j, i] == 1
count = count + 1
if count > 1
mat[j, i] = 0
end
end
end
end
return mat
end

function myfindmax(m, col)
v = m[1, col]
ix = 1
@inbounds for row in 2:size(m, 1)
if m[row, col] > v
ix = row
v = m[row, col]
end
end
return ix
end

@inbounds for col in 1:size(a, 2)
end
end

function benchmark()
Random.seed!(0)
x = rand(1_000, 1_000);

@assert findmax_mask(x)[2] == remove_extra_1s((x .== maximum(x, dims=1))) # true

@btime remove_extra_1s((\$x .== maximum(\$x, dims=1)));
nothing
end

benchmark()
``````

gives

``````  4.782 ms (9 allocations: 7.65 MiB)
2.648 ms (7 allocations: 134.44 KiB)
1.214 ms (2 allocations: 976.70 KiB)
``````
1 Like

Why do you say its brittle? Is it because of hard-coding loops?

I say it is brittle because:

1. `myfindmax` and `my_mask_creator` do not check on the type of `m` or `a`;
2. `myfindmax` expects `m` to not be empty (otherwise throwing a generic bounds error);
3. `myfindmax` and `my_mask_creator` expect indexes to start at one (what is not always true);
1 Like

This is quite compact and about 15% faster than Henrique’s function.

``````function findmax_mask2(x)
c = 0
for col in eachcol(x)
end
end

function benchmark()
Random.seed!(0)
x = rand(1_000, 1_000)
nothing
end

benchmark()
1.170 ms (2 allocations: 976.70 KiB)
1.040 ms (3 allocations: 122.25 KiB)
``````
4 Likes

Very compact indeed, but at the hidden `c+=1`! You could enumerate eachcol to get this effect as well (saving another line, to boot!). I think the most “robust” way to do it is probably with the `axes` function though, to account for the non-1 based indexing edge-case @Henrique_Becker mentioned. I think it’s usually pretty safe to ignore this edge case though

2 Likes

Thank you for all the implementations, I’ve learnt something from each of them.

I’ve found that using multithreading is much slower than using `@inboudns`. Why is that? Is it taking more time to sync?

Times of some attempts, note that my `x` here has some repeated elements:

``````julia> Random.seed!(42); x = rand(randn(1_000), 1_000, 1_000); # some repeats

julia> sum(x .== maximum(x, dims=1), dims=1)
1×1000 Array{Int64,2}:
1  1  2  2  1  2  1  4  2  1  1  1  …  2  1  2  1  3  3  2  4  2  1  3  3

julia> @btime vreduce(max, \$x, dims=1);
193.752 μs (1 allocation: 7.94 KiB)

5.286 ms (4 allocations: 7.65 MiB)

julia> @btime remove_extra_1s((\$x .== maximum(\$x, dims=1))); # from vkv
3.063 ms (8 allocations: 134.48 KiB)

julia> @btime my_mask_creator(\$x); # from Henrique_Becker
1.263 ms (2 allocations: 976.70 KiB)

julia> @btime findmax_mask2(\$x); # from Seif_Shebl --> 1.667 ms no repeats
2.308 ms (3 allocations: 122.25 KiB)

julia> @btime findmax_mask3(\$x); # --> 1.149 ms no repeats
998.595 μs (4 allocations: 130.19 KiB)

488.894 μs (95 allocations: 136.55 KiB)

# Attempts at a vectorised version:
julia> @btime findmax_mask4(\$x); # --> 1.329 ms no repeats
1.328 ms (3 allocations: 984.64 KiB)

797.529 μs (4 allocations: 130.19 KiB)

# Alternative x, some times change, marked -->
julia> Random.seed!(0); x = rand(1_000, 1_000); # no repeats

julia> sum(x .== maximum(x, dims=1), dims=1)
1×1000 Array{Int64,2}:
1  1  1  1  1  1  1  1  1  1  1  1  …  1  1  1  1  1  1  1  1  1  1  1  1

julia> VERSION
v"1.5.0"
``````

All of these are quite far from `vreduce`, which made me think it ought to be possible to do better. `findmax_mask6` is an attempt to do this, with LoopVectorization, but it’s not actually faster. (Edit – some improvements.) But perhaps @Elrod knows a better way?

``````using LoopVectorization, Tullio, Random, BenchmarkTools

y = vreduce(max, x, dims=1)
# mask = fill(false, size(x)) # Array{Bool,2} much the same speed
for c in axes(x,2)
@inbounds for r in axes(x,1)
flag = x[r,c] == y[c]
flag && break
end
end
end

@tullio (max) y[c] := x[r,c];
@inbounds for r in axes(x,1)
flag = x[r,c] == y[c]
flag && break # this prevents @simd
end
end
end

y = vreduce(max, x, dims=1)
# mask = falses(size(x)) # BitArray{2}
mask = fill(false, size(x)) # Array{Bool,2} is faster
@inbounds for c in axes(x,2)
seen = false
@simd for r in axes(x,1)
flag = !seen & (x[r,c] == y[c])
seen |= flag
end
end
end

x0 = rand(1:5, 10,30) # small, many repeats
x0 = rand(rand(100), 100, 1000); # some repeats

using LoopVectorization: vifelse
SVec{4,Int}(1,2,3,4) # SVec{4,Int64}<1, 2, 3, 4>

@inline onlyone(cond::Bool, seen::Int) = cond && iszero(seen)
@inline function onlyone(cond::Mask{W}, seen::Union{Int,SVec}) where {W}
allzero(seen) || return zero(cond)
end
@inline function hibit(n::UInt16) # https://stackoverflow.com/questions/53161/find-the-highest-order-bit-in-c
n |= (n >>  1);
n |= (n >>  2);
n |= (n >>  4);
n |= (n >>  8);
return n - (n >> 1)
end
@inline function hibit(n::UInt8)
n |= (n >>  1);
n |= (n >>  2);
n |= (n >>  4);
return n - (n >> 1)
end

@inline allzero(seen::Int) = iszero(seen)
# @inline allzero(seen::SVec{N,Int}) where {N} = all(ntuple(i -> iszero(seen.data[i].value), N))
@inline allzero(seen::SVec{N,Int}) where {N} = iszero((!iszero(seen)).u)

@inline anyone(cond::Bool) = cond
@inline anyone(cond::Mask) = cond != zero(cond)

y = vreduce(max, x, dims=1)
mask = falses(size(x)) # BitArray{2} # faster
# mask = fill(false, size(x)) # Array{Bool,2}
@avx for c in axes(x,2)
seen = 0
for r in axes(x,1)
flag = onlyone(x[r,c] == y[c], seen)
seen += anyone(flag)
end
end
end

findmax_mask6(x0) == findmax_mask2(x0) # false, because it chooses the last not first
``````
2 Likes

How many threads did you use? I find multithreading performance to be inconsistent with matrix sizes.

If you want a more general definition of `hibit`:

``````hibitv2(x) = (one(x) << (8sizeof(x)-1)) >>> leading_zeros(x)
``````

or (using VectorizationBase 0.12.32; function existed in older versions, but I only just now optimized it)

``````using VectorizationBase: prevpow2
``````

Now, check out the differences in assembly, origional:

``````# julia> @code_native debuginfo=:none syntax=:intel hibit(0x0c07)
.text
movzx   eax, di
shr     eax
or      eax, edi
movzx   eax, ax
mov     ecx, eax
shr     ecx, 2
or      ecx, eax
mov     edx, ecx
shr     edx, 4
or      edx, ecx
mov     eax, edx
shr     eax, 8
or      eax, edx
mov     ecx, eax
shr     ecx
sub     eax, ecx
ret
nop     word ptr cs:[rax + rax]
``````

Version 2:

``````# julia> @code_native debuginfo=:none syntax=:intel hibitv2(0x0c07)
.text
xor     ecx, ecx
lzcnt   ax, di
mov     edx, 32768
shrx    eax, edx, eax
cmovb   eax, ecx
ret
nop     word ptr cs:[rax + rax]
nop
``````

From VectorizationBase:

``````# julia> @code_native debuginfo=:none syntax=:intel prevpow2(0x0c07)
.text
lzcnt   ax, di
mov     ecx, 32768
shrx    eax, ecx, eax
ret
``````

Note that on recent architectures like Skylake `lzcnt` has 3 cycles of latency, instead of the 1 cycle of and/or/shift-type operations (or 2 for `mov[zx]`).
Still, this is a clear win.

Your `anyone` is optimal in terms of assembly, but `any(::Mask)` is already defined.

Another point is that `@avx` won’t unroll much when it doesn’t recognize the the functions. Probably simplest to manually try a few `@avx unroll=(1,6)`, for example.

Finally, I’ll add a new `BitArray`-like type to PaddedMatrices soon.
If any array needs padding, it’s `BitArray`s. SIMD-`CartesianIndex`ing into `BitArray`s is very costly because you have to do a lot of bit operations to load/store the correct vector. If we can guarantee that each column starts with a fresh byte, and not part way in between bytes, we can eliminate most of those operations.

3 Likes

Many thanks, we are down to `628.492 μs` now.

When I try `unroll=(1,6)` with a tuple, I get this error:

``````**ERROR:** MethodError: no method matching bitstore!(::VectorizationBase.PackedStridedBitPointer{1,2}, ::Mask{8,UInt8}, ::SVec{8,Int32})
Stacktrace:
``````

In a loop like this, is there any way to insist that `seen` really remains an integer, not an SVec?

``````   @avx unroll=4 for c in axes(x,2)
seen = 0
for r in axes(x,1)
flag = onlyone(x[r,c] == y[c], seen)
I tried for instance moving `@avx` to the inner loop, but then I get
``````ERROR: MethodError: no method matching subsetview(::VectorizationBase.PackedStridedBitPointer{1,2}, ::Val{2}, ::Int64)
This was `Threads.nthreads() == 4`, possibly unwisely, on a 2-core laptop.