Calculating Statistics of an Array with A Mask (Masked Array)

I saw a great thread on performance at Slack.
I asked to reproduce it here as I think it will be beneficial for many.

The participants were: Alexander Plavin, @danielwe, @jar, @mcabbott .

It started with Alexander’s question:

How to performantly compute a sum / norm / mean / … of a part of an array defined by a mask?
view allocates a new array:

julia> a = rand(100, 100);
julia> b = rand(Bool, 100, 100);

julia> @btime sum(view($a, $b))
  10.750 μs (4 allocations: 39.88 KiB)
2546.49522119921

There were some ideas by Daniel and Michael which I summarized in:

using Pkg;
Pkg.activate(; temp = true);
Pkg.add("BenchmarkTools");
Pkg.add("StatsBase");
Pkg.add("Tullio");
using BenchmarkTools;
using StatsBase;
using Tullio;

vA = rand(1_000_000);
vB = rand(Bool, 1_000_000);

# Summation
@btime sum(view($vA, $vB)); #<! Reference

MaskedSum1( vA :: Vector, vB :: Vector{<:Bool} ) = sum(vA[ii] for ii in eachindex(vA, vB) if vB[ii]);
MaskedSum2( vA :: Vector, vB :: Vector{<:Bool} ) = sum(ai * bi for (ai, bi) in zip(vA, vB));
MaskedSum3( vA :: Vector, vB :: Vector{<:Bool} ) = sum(ifelse(bi, ai, zero(eltype(vA))) for (ai, bi) in zip(vA, vB));

@btime MaskedSum1($vA, $vB);
@btime mapreduce(*, +, $vA, $vB); 
@btime @tullio _ := $vA[i] * $vB[i]; 
@btime @tullio _ := ifelse($vB[i], $vA[i], zero($vA[i]));
@btime MaskedSum2($vA, $vB);
@btime MaskedSum3($vA, $vB); 
# StatsBase.jl (Using the weights)
@btime sum(vA, weights($vB));

# Mean Value
# Can use any of the above as `masekdSum / sum(vB);`
@btime mean($vA, weights($vB)); 

# Standard Deviation
# StatsBase.jl (Using the weights)
# StatBase.jl has `mean_and_std()`
@btime std($vA, AnalyticWeights(vB)); 
@btime std(a for (a, b) in zip($vA, $vB) if b); 

Then there was a discussion about insights / optimizations / weirdness / challenges:

Michael:

Weird that ifelse is quicker, I mean why doesn’t *(::Float64, ::Bool) do that? NaN should be the same since NaN * false === 0.0.

Daniel:

Btw. worth keeping in mind that zip just iterates linearly and stops when the first iterator ends, so it doesn’t error if a and b don’t have matching shapes and sizes. That’s why I used eachindex in my original suggestion. However, bounds checks kill the performance here, so @inbounds is essential:

julia> maskedsum(a, b) = sum(ifelse(b[i], a[i], zero(eltype(a))) for i in eachindex(a, b))
maskedsum (generic function with 1 method)

julia> @btime maskedsum($a, $b)
  90.933 μs (0 allocations: 0 bytes)
2446.0095456237823

julia> maskedsum(a, b) = sum(ifelse(@inbounds(b[i]), @inbounds(a[i]), zero(eltype(a))) for i in eachindex(a, b))
maskedsum (generic function with 1 method)

julia> @btime maskedsum($a, $b)
  25.409 μs (0 allocations: 0 bytes)
2446.0095456237823

Surprised that the compiler fails to elide bounds checks in this case :confused:

Alexander:

Generally, working with parts of arrays performantly is nontrivial in Julia…

julia> @btime $a[$b] .*= 1
  28.750 μs (6 allocations: 79.67 KiB)

Surely it shouldn’t need any allocations, right?
So weird! views give more memory allocation, but faster code

julia> @btime $a[$b] .*= 1;
  32.583 μs (6 allocations: 79.67 KiB)

julia> @btime @views $a[$b] .*= 1;
  18.916 μs (12 allocations: 159.34 KiB)

Daniel:

Wonder if it would be feasible to implement non-allocating non-strided views that generate the appropriate loops/filtered iterators in generic code, and only allocate when passed to libraries like LAPACK that require strided arrays?

Michael:

I guess a[b].=c just makes view(a,b) first, which allocates the array of indices, and then the broadcast gets something it understands. Which is fine for a[end,:] .= c . With the bit mask… I suppose it’s not so trivial to make this faster in general, as you need size(a[b]) to know what to do? Something like a[b] .= f.(c[b]) ./ 2 the broadcast could short-cut, but it’s not set up to remember what to do that many steps deep.
The footgun in a[b] .*= 1 is that it is view(a,b) .= a[b] .* 1 , only the LHS is automatically a view.

julia> @btime $a[$b] .*= 1;
  14.917 μs (7 allocations: 76.91 KiB)

julia> @btime $a[$b];
  6.760 μs (3 allocations: 38.44 KiB)

julia> @btime view($a, $b) .*= 1;
  6.312 μs (4 allocations: 38.47 KiB)

What on earth the one with @views is doing wrong I don’t know:

julia> @btime @views $a[$b] .*= 1;
  17.875 μs (14 allocations: 153.81 KiB)

Alexander:

Ok probably this is fastest:

julia> @btime @. a = ifelse(b, 1*a, a);
  4.363 μs (2 allocations: 80 bytes)

but definitely nontrivial to discover / understand!

A discussion on generalization is given at Calculate Statistics of an Array based on Labels.

2 Likes

This is a pretty common stumbling block, unfortunately: