function foo(x)
s = zero(eltype(x))
@inbounds for i in eachindex(x)
s = x[i ⊻ s]
end
return s
end;
What is the point of the author’s frequent use of the XOR operator when indexing arrays?
The result of i ⊻ s is that number whose bits are 1 if the corresponding bits of i and s are different, 0 otherwise. I just don’t see the connection to using this as an index…
On another note, why does he need to use eltype(x) when he just wants to use a 0 for indexing an array?
x is intended to be an array, eltype(x) gets the type of x’s elements, zero(eltype(x)) makes a zero with that type. The foo([UInt8(1)]) would then start with s = UInt8(0).
A literal 0 could only have the type Int. That would make s have >1 concrete type throughout the method, aka a type instability that the compiler has to make more instructions to handle. 2 types doesn’t hurt performance that much, but techniques like zero(eltype(x)) are common in order to achieve type stability as much as possible.
I think it’s only to generate instructions to demonstrate. Assuming x is a UInt vector that starts with UInt(1), we always start with s = Uint8(0) and i = 1, thus i ⊻ s == 1 and s = x[1]. The next iteration starts with s = UInt(1) and i = 2, thus i ⊻ s == 3 and s = x[3]. There you can see that problem that if we only had 2 elements to begin with, we will index out of bounds. The @inbounds macro removes bounds-check error handling and is intended to improve performance when the coder knows they can’t go out of bounds. This method can easily access memory unsafely, I can’t see a practice use for it.
I don’t know about the posted code (which might be a toy example?), but there are other practical algorithms that do complicated-looking xor-based calculations to repeatedly update array indices in a loop, e.g. CRC32 checksums.
Author here. I use xor for a few different reasons. In the random_access function, I use it in order to randomize the index at every iteration to defeat the prefetcher. I could use rand to randomize, or similar, but I wanted to make sure that the time spent randomizing the indices would be dwarfed by the cache misses (which honestly would probably still be the case), and xor + a shift is fast.
In the linear_access function, I don’t want to sabotage the prefetcher, but I want the function to have as similar assembly as random_access, so that when they are contrasted, the difference in performance is purely from whether the prefetcher is able to do its work or not.
I also want to defeat other optimizations, such as autovectorisation which is not introduced until later in the blog. I believe aligment_test and foo would both vectorise if I didn’t insert an xor.
More generally, the examples in the blog post had to be constructed a little carefully, because in each section, I introduce a single concept and want the performance of that function to explore that concept only. In reality, the performance of a function is impacted by many different aspects. So if you’re not careful, you end up with examples impacted by multiple different concepts, some of which has not yet been explained, and you would have to write
In this example, foo is faster than bar because it exploits concept X. Okay actually the output shows bar is faster, but this is due to a different memory dependency in the code that messes up instruction level parallelism in foo. Sure you don’t know what that means until the next chapter, but trust me bro, concept X matters a lot and foo ought to be faster even though it isn’t.