In a tutorial about computer hardware (https://viralinstruction.com/posts/hardware/), I came across the following Julia code:
s = zero(eltype(x))
@inbounds for i in eachindex(x)
s = x[i ⊻ s]
What is the point of the author’s frequent use of the XOR operator when indexing arrays?
The result of
i ⊻ s is that number whose bits are 1 if the corresponding bits of
s are different, 0 otherwise. I just don’t see the connection to using this as an index…
On another note, why does he need to use
eltype(x) when he just wants to use a 0 for indexing an array?
x is intended to be an array,
eltype(x) gets the type of
zero(eltype(x)) makes a zero with that type. The
foo([UInt8(1)]) would then start with
s = UInt8(0).
0 could only have the type
Int. That would make
s have >1 concrete type throughout the method, aka a type instability that the compiler has to make more instructions to handle. 2 types doesn’t hurt performance that much, but techniques like
zero(eltype(x)) are common in order to achieve type stability as much as possible.
I think it’s only to generate instructions to demonstrate. Assuming
x is a
UInt vector that starts with
UInt(1), we always start with
s = Uint8(0) and
i = 1, thus
i ⊻ s == 1 and
s = x. The next iteration starts with
s = UInt(1) and
i = 2, thus
i ⊻ s == 3 and
s = x. There you can see that problem that if we only had 2 elements to begin with, we will index out of bounds. The
@inbounds macro removes bounds-check error handling and is intended to improve performance when the coder knows they can’t go out of bounds. This method can easily access memory unsafely, I can’t see a practice use for it.
I don’t know about the posted code (which might be a toy example?), but there are other practical algorithms that do complicated-looking xor-based calculations to repeatedly update array indices in a loop, e.g. CRC32 checksums.
It definitely helps that the table has 256 elements and the index is always a
UInt8 value added to
Author here. I use xor for a few different reasons. In the
random_access function, I use it in order to randomize the index at every iteration to defeat the prefetcher. I could use
rand to randomize, or similar, but I wanted to make sure that the time spent randomizing the indices would be dwarfed by the cache misses (which honestly would probably still be the case), and xor + a shift is fast.
linear_access function, I don’t want to sabotage the prefetcher, but I want the function to have as similar assembly as
random_access, so that when they are contrasted, the difference in performance is purely from whether the prefetcher is able to do its work or not.
I also want to defeat other optimizations, such as autovectorisation which is not introduced until later in the blog. I believe
foo would both vectorise if I didn’t insert an xor.
More generally, the examples in the blog post had to be constructed a little carefully, because in each section, I introduce a single concept and want the performance of that function to explore that concept only. In reality, the performance of a function is impacted by many different aspects. So if you’re not careful, you end up with examples impacted by multiple different concepts, some of which has not yet been explained, and you would have to write
In this example,
foo is faster than
bar because it exploits concept X. Okay actually the output shows
bar is faster, but this is due to a different memory dependency in the code that messes up instruction level parallelism in
foo. Sure you don’t know what that means until the next chapter, but trust me bro, concept X matters a lot and
foo ought to be faster even though it isn’t.
Thank you for the explanation. Did not expect an answer from the author himself
Thanks for the explanation!