I saw a great thread on performance at Slack.

I asked to reproduce it here as I think it will be beneficial for many.

The participants were: Alexander Plavin, @danielwe, @jar, @mcabbott .

It started with Alexander’s question:

How to performantly compute a sum / norm / mean / … of a part of an array defined by a mask?

view allocates a new array:

```
julia> a = rand(100, 100);
julia> b = rand(Bool, 100, 100);
julia> @btime sum(view($a, $b))
10.750 μs (4 allocations: 39.88 KiB)
2546.49522119921
```

There were some ideas by Daniel and Michael which I summarized in:

```
using Pkg;
Pkg.activate(; temp = true);
Pkg.add("BenchmarkTools");
Pkg.add("StatsBase");
Pkg.add("Tullio");
using BenchmarkTools;
using StatsBase;
using Tullio;
vA = rand(1_000_000);
vB = rand(Bool, 1_000_000);
# Summation
@btime sum(view($vA, $vB)); #<! Reference
MaskedSum1( vA :: Vector, vB :: Vector{<:Bool} ) = sum(vA[ii] for ii in eachindex(vA, vB) if vB[ii]);
MaskedSum2( vA :: Vector, vB :: Vector{<:Bool} ) = sum(ai * bi for (ai, bi) in zip(vA, vB));
MaskedSum3( vA :: Vector, vB :: Vector{<:Bool} ) = sum(ifelse(bi, ai, zero(eltype(vA))) for (ai, bi) in zip(vA, vB));
@btime MaskedSum1($vA, $vB);
@btime mapreduce(*, +, $vA, $vB);
@btime @tullio _ := $vA[i] * $vB[i];
@btime @tullio _ := ifelse($vB[i], $vA[i], zero($vA[i]));
@btime MaskedSum2($vA, $vB);
@btime MaskedSum3($vA, $vB);
# StatsBase.jl (Using the weights)
@btime sum(vA, weights($vB));
# Mean Value
# Can use any of the above as `masekdSum / sum(vB);`
@btime mean($vA, weights($vB));
# Standard Deviation
# StatsBase.jl (Using the weights)
# StatBase.jl has `mean_and_std()`
@btime std($vA, AnalyticWeights(vB));
@btime std(a for (a, b) in zip($vA, $vB) if b);
```

Then there was a discussion about insights / optimizations / weirdness / challenges:

Michael:

Weird that ifelse is quicker, I mean why doesn’t

`*(::Float64, ::Bool)`

do that?`NaN`

should be the same since`NaN * false === 0.0`

.

Daniel:

Btw. worth keeping in mind that

`zip`

just iterates linearly and stops when the first iterator ends, so it doesn’t error if`a`

and`b`

don’t have matching shapes and sizes. That’s why I used`eachindex`

in my original suggestion. However, bounds checks kill the performance here, so`@inbounds`

is essential:

```
julia> maskedsum(a, b) = sum(ifelse(b[i], a[i], zero(eltype(a))) for i in eachindex(a, b))
maskedsum (generic function with 1 method)
julia> @btime maskedsum($a, $b)
90.933 μs (0 allocations: 0 bytes)
2446.0095456237823
julia> maskedsum(a, b) = sum(ifelse(@inbounds(b[i]), @inbounds(a[i]), zero(eltype(a))) for i in eachindex(a, b))
maskedsum (generic function with 1 method)
julia> @btime maskedsum($a, $b)
25.409 μs (0 allocations: 0 bytes)
2446.0095456237823
```

Surprised that the compiler fails to elide bounds checks in this case

Alexander:

Generally, working with parts of arrays performantly is nontrivial in Julia…

```
julia> @btime $a[$b] .*= 1
28.750 μs (6 allocations: 79.67 KiB)
```

Surely it shouldn’t need any allocations, right?

So weird! views give more memory allocation, but faster code

```
julia> @btime $a[$b] .*= 1;
32.583 μs (6 allocations: 79.67 KiB)
julia> @btime @views $a[$b] .*= 1;
18.916 μs (12 allocations: 159.34 KiB)
```

Daniel:

Wonder if it would be feasible to implement non-allocating non-strided views that generate the appropriate loops/filtered iterators in generic code, and only allocate when passed to libraries like LAPACK that require strided arrays?

Michael:

I guess

`a[b].=c`

just makes`view(a,b)`

first, which allocates the array of indices, and then the broadcast gets something it understands. Which is fine for`a[end,:] .= c`

. With the bit mask… I suppose it’s not so trivial to make this faster in general, as you need`size(a[b])`

to know what to do? Something like`a[b] .= f.(c[b]) ./ 2`

the broadcast could short-cut, but it’s not set up to remember what to do that many steps deep.

The footgun in`a[b] .*= 1`

is that it is`view(a,b) .= a[b] .* 1`

, only the LHS is automatically a view.

```
julia> @btime $a[$b] .*= 1;
14.917 μs (7 allocations: 76.91 KiB)
julia> @btime $a[$b];
6.760 μs (3 allocations: 38.44 KiB)
julia> @btime view($a, $b) .*= 1;
6.312 μs (4 allocations: 38.47 KiB)
```

What on earth the one with

`@views`

is doing wrong I don’t know:

```
julia> @btime @views $a[$b] .*= 1;
17.875 μs (14 allocations: 153.81 KiB)
```

Alexander:

Ok probably this is fastest:

```
julia> @btime @. a = ifelse(b, 1*a, a);
4.363 μs (2 allocations: 80 bytes)
```

but definitely nontrivial to discover / understand!

A discussion on generalization is given at Calculate Statistics of an Array based on Labels.