I saw a great thread on performance at Slack.
I asked to reproduce it here as I think it will be beneficial for many.
The participants were: Alexander Plavin, @danielwe, @jar, @mcabbott .
It started with Alexander’s question:
How to performantly compute a sum / norm / mean / … of a part of an array defined by a mask?
view allocates a new array:
julia> a = rand(100, 100);
julia> b = rand(Bool, 100, 100);
julia> @btime sum(view($a, $b))
  10.750 μs (4 allocations: 39.88 KiB)
2546.49522119921
There were some ideas by Daniel and Michael which I summarized in:
using Pkg;
Pkg.activate(; temp = true);
Pkg.add("BenchmarkTools");
Pkg.add("StatsBase");
Pkg.add("Tullio");
using BenchmarkTools;
using StatsBase;
using Tullio;
vA = rand(1_000_000);
vB = rand(Bool, 1_000_000);
# Summation
@btime sum(view($vA, $vB)); #<! Reference
MaskedSum1( vA :: Vector, vB :: Vector{<:Bool} ) = sum(vA[ii] for ii in eachindex(vA, vB) if vB[ii]);
MaskedSum2( vA :: Vector, vB :: Vector{<:Bool} ) = sum(ai * bi for (ai, bi) in zip(vA, vB));
MaskedSum3( vA :: Vector, vB :: Vector{<:Bool} ) = sum(ifelse(bi, ai, zero(eltype(vA))) for (ai, bi) in zip(vA, vB));
@btime MaskedSum1($vA, $vB);
@btime mapreduce(*, +, $vA, $vB); 
@btime @tullio _ := $vA[i] * $vB[i]; 
@btime @tullio _ := ifelse($vB[i], $vA[i], zero($vA[i]));
@btime MaskedSum2($vA, $vB);
@btime MaskedSum3($vA, $vB); 
# StatsBase.jl (Using the weights)
@btime sum(vA, weights($vB));
# Mean Value
# Can use any of the above as `masekdSum / sum(vB);`
@btime mean($vA, weights($vB)); 
# Standard Deviation
# StatsBase.jl (Using the weights)
# StatBase.jl has `mean_and_std()`
@btime std($vA, AnalyticWeights(vB)); 
@btime std(a for (a, b) in zip($vA, $vB) if b); 
Then there was a discussion about insights / optimizations / weirdness / challenges:
Michael:
Weird that ifelse is quicker, I mean why doesn’t
*(::Float64, ::Bool)do that?NaNshould be the same sinceNaN * false === 0.0.
Daniel:
Btw. worth keeping in mind that
zipjust iterates linearly and stops when the first iterator ends, so it doesn’t error ifaandbdon’t have matching shapes and sizes. That’s why I usedeachindexin my original suggestion. However, bounds checks kill the performance here, so@inboundsis essential:
julia> maskedsum(a, b) = sum(ifelse(b[i], a[i], zero(eltype(a))) for i in eachindex(a, b))
maskedsum (generic function with 1 method)
julia> @btime maskedsum($a, $b)
  90.933 μs (0 allocations: 0 bytes)
2446.0095456237823
julia> maskedsum(a, b) = sum(ifelse(@inbounds(b[i]), @inbounds(a[i]), zero(eltype(a))) for i in eachindex(a, b))
maskedsum (generic function with 1 method)
julia> @btime maskedsum($a, $b)
  25.409 μs (0 allocations: 0 bytes)
2446.0095456237823
Surprised that the compiler fails to elide bounds checks in this case
Alexander:
Generally, working with parts of arrays performantly is nontrivial in Julia…
julia> @btime $a[$b] .*= 1
  28.750 μs (6 allocations: 79.67 KiB)
Surely it shouldn’t need any allocations, right?
So weird! views give more memory allocation, but faster code
julia> @btime $a[$b] .*= 1;
  32.583 μs (6 allocations: 79.67 KiB)
julia> @btime @views $a[$b] .*= 1;
  18.916 μs (12 allocations: 159.34 KiB)
Daniel:
Wonder if it would be feasible to implement non-allocating non-strided views that generate the appropriate loops/filtered iterators in generic code, and only allocate when passed to libraries like LAPACK that require strided arrays?
Michael:
I guess
a[b].=cjust makesview(a,b)first, which allocates the array of indices, and then the broadcast gets something it understands. Which is fine fora[end,:] .= c. With the bit mask… I suppose it’s not so trivial to make this faster in general, as you needsize(a[b])to know what to do? Something likea[b] .= f.(c[b]) ./ 2the broadcast could short-cut, but it’s not set up to remember what to do that many steps deep.
The footgun ina[b] .*= 1is that it isview(a,b) .= a[b] .* 1, only the LHS is automatically a view.
julia> @btime $a[$b] .*= 1;
  14.917 μs (7 allocations: 76.91 KiB)
julia> @btime $a[$b];
  6.760 μs (3 allocations: 38.44 KiB)
julia> @btime view($a, $b) .*= 1;
  6.312 μs (4 allocations: 38.47 KiB)
What on earth the one with
@viewsis doing wrong I don’t know:
julia> @btime @views $a[$b] .*= 1;
  17.875 μs (14 allocations: 153.81 KiB)
Alexander:
Ok probably this is fastest:
julia> @btime @. a = ifelse(b, 1*a, a);
  4.363 μs (2 allocations: 80 bytes)
but definitely nontrivial to discover / understand!
A discussion on generalization is given at Calculate Statistics of an Array based on Labels.
