I’m trying to understand the interaction between array views and Boolean indexing.
In general, my understanding (and experience thus far) is that views generally increasing performance when you’re not modifying the array slices. For example:
x = collect(1:1000)
@btime z = x[50:200];
107.278 ns (1 allocation: 1.33 KiB)
@btime z = @view x[50:200];
19.084 ns (1 allocation: 48 bytes)
However, now do the equivalent operations using Boolean indexing, and the performance is reversed:
w = (x .>= 50) .& (x .<= 200);
@btime z = $x[$w];
167.926 ns (3 allocations: 1.38 KiB)
@btime z = @view $x[$w];
210.821 ns (6 allocations: 1.45 KiB)
Interesting. So I go to the Julia manual to read more about logical indexing. The manual states:
Indexing by a boolean vector B is effectively the same as indexing by the vector of integers that is returned by findall(B)…It is generally more efficient to use boolean arrays as indices directly instead of first calling findall.
Let’s see how explicitly calling findall affects performance:
The extra function call decreases performance in the first case (as expected), but actually increases performance when using @view!
So that I know when to use @view in my own code, I’m trying to wrap my head around what’s going here. Is this just a corner case where @view is not performing well because it hasn’t been optimized for logical indexing? Is there anyway to get the combined performance of logical indexing and views simultaneously? Is there a deeper insight here I should be absorbing about how views work?
Good questions and astute observations. Efficiently making use of logical indexing in views is indeed a challenge. Let’s make things a little more random for the sake of argument:
x = rand(1000)
w = x .> 0.5
Now, when you do r = x[w] (without @view), we can completely avoid allocating the array of true indices. We can just walk through x and w simultaneously and only put the true elements in the result r. Easy peasy and super efficient.
On the other hand, when you do v = @view x[w], we need to give you back an array that supports random access. Folks expect v to by and large act just like r from above… and could easily write something like:
s = 0.0
for i in eachindex(v)
s += v[i]
end
Now, can you see the problem if we avoid constructing the array of indices a la findall? Every time we index into v, we need to walk through the mask and find the ith true value. Suddenly what had would have been an O(n) algorithm with regular indexing becomes O(n^2) with views! Definitely not what you’d want. So we effectively do findall for views.
Why doesn’t it perform the same then? Good question. They use different implementations due to subtle differences in how they compute their lengths and if I remember right findall recently got some big perf optimizations that haven’t made it to logical indices. But it’s been a while since I’ve looked closely at this.
In addition to @mbauman’s excellent explanation of this particular case, I don’t think you will find a general rule for this. Depending on memory access patterns, array sizes, index mappings (think: sparse arrays), either view or instantiation of the subarray may be more efficient. Copying data is not always bad.
I would suggest that you
don’t worry about this too much until you have benchmarked and profiled your code,
keep your code modular and flexible so that you can change this easily if necessary,
in the final stage of working on a problem, benchmark and optimize if necessary.
It was the other way around
We first got the good logical indexing code and it was not trivial to turn that into fast findall code, because findall needs to build cartesian indices.
Ultimately the issue is that view(vec, logical_index) needs to store some variant of findall(logical_index) (hopefully linear indices; if we are unlucky, then cartesian indices), which tends to be of similar size or larger than vec[logical_index]. So there is no performance or memory gain from the view (unless you have arrays of very large structs), and taking a copy is often faster. But we cannot take a copy instead, because we need to support mutation semantics (writes into a view must mutate the underlying array).
So the TLDR is: Mixing views and logical indices is currently no good.
There are hopefully upcoming optimizations for broadcasting with logical indices, at the next window for breaking changes (i.e. 2.0). During broadcasts with logical indices, it is theoretically possible to avoid allocating the findall(logical_index). That is because broadcasts only need an iterator-view, not a view that supports random access. Unfortunately the current broadcasting spec does not permit optimizing assignments dst[logical_index] .= rhs to avoid storage of the findall (because we need to return the left-hand-side as a view that allows random access), so we need to wait until this can be implemented.
The findall and iteration over logical indices is pretty fast by now. Nevertheless, if you need to use the same logical index for multiple views into arrays that support linear indexing, you should consider writing something like
(note that the explicit indices gets reused for many views! Also note that you can iterate over Base.LogicalIndex{Int}(cond) pretty quickly if you don’t need views; then you can save the large alloc of indices).
don’t worry about this too much until you have benchmarked and profiled your code,
keep your code modular and flexible so that you can change this easily if necessary,
in the final stage of working on a problem, benchmark and optimize if necessary.
Benchmarking and profiling real code where I was repeatedly doing logical indexing in a for loop was in fact what brought this to my attention. The simplified example in my question was merely for purposes of illustration/discussion.
I must admit the unpredictability of how views will perform is slightly off-putting.
These are very helpful explanations of what the bottlenecks are – thank you!
Unfortunately I can’t reuse the logical index in my particular use case – I need to repeatedly recalculate the index in a for loop. I’ll stick to instantiation of subarrays for now, and will consider revisiting views for this particular problem when broadcasting with logical indices has been optimized.
for prob in outer_loop
cond = compute_bitmatrix(args...)
for idx in Base.LogicalIndex{Int}(cond)
#do something
end
end
That is, you can maybe avoid broadcasting and views alltogether and inline your kernel. That would avoid the issue and allow you to use the fast logical indexing code. I.e. it is much faster than the naive (but equivalent)
for prob in outer_loop
cond = compute_bitarray(args...)
for idx in 1:length(cond)
cond[idx] || continue
#do something
end
end
Regardless of whether you use subarrays or views, at some point you need to grab the indices of all set bits in a bitarray. First, you want to use the base code for that (either by looking at it or by calling into base), because it makes proper use of the BMI instructions that are available on x86. Second you want to do this only once, if possible. If you take multiple subarrays with the same logical index, then you are doing this multiple times. Optimized or not, the fastest code is code that does not run.
Another solution that I’ve used in the past is to broadcast ifelse over the logical mask — which allows it to fuse with the logical mask creation itself as well as other parts of the expression. The possibility of using this trick depends upon your ability to pick a value for the false case that doesn’t contribute to the final answer or can be removed later.
A bunch of us had wanted to switch to using views by default back in the 0.3/0.4 timeframe… but even when ignoring logical indexing and the construction of the temporary object that sometimes is used to represent views, there are so many subtleties to modern computer memory access latencies that it’s not a clear win. Had it been a clear win, we would have bit the bullet and made views the default for indexing.