What is the fastest way to update a particular row of a matrix using dot syntax?

To reproduced that on your own computer, use M = rand(498, 498); or M = rand(502, 502); instead.
But the phenomenon won’t be as extreme.

Basically, the loop without @inbounds isn’t slow because checking bounds is slow, but because currently checking bounds prevents SIMD (doing multiple things per CPU instruction).
When you load or store memory with a SIMD instruction, it takes about twice as long when that load/store crosses a 64-byte boundary (or, a CPU is capable of half as many such operations per clock cycle).
64-bytes is 8 Float64 (8*sizeof(Float64) == 64), and the start of big Arrays in Julia are aligned to such a boundary.

Let’s start with column 3, which is 1000 elements from the start. We have such a boundary every 8 elements. So because 1000 is a multiple of 8, that means we have:

  • boundary between 1000 and 1001
  • load and then store numbers from 1001 to 1008
  • boundary between 1008 and 1009
  • load and then store numbers from 1009 to 10016
  • boundary between 1016 and 1017

etc. That is, our loads and stores never cross one of these boundaries.

Low lets look at column 2. It starts 500 elements from the beginning of the array. 500 is not a multiple of 8, but 496 and 504 are. So we have

  • load and then store numbers from 501 to 508, crossing 504-505 boundary
  • load and then store numbers from 509 to 516, crossing 512-513 boundary

etc. Every one of the loads/stores crosses such a boundary, making them all slow.
Note that most laptop and desktop computers will load and store 4 Float64 at a time instead of 8. This means for them, things would instead look like this

  • load and then store numbers from 501 to 504
  • boundary between 504 and 505
  • load and then store numbers from 505 to 508
  • load and then store numbers from 509 to 512
  • boundary between 512 and 513
  • load and then store numbers from 513 to 516

Meaning it will avoid crossing any such boundaries, so performance should be the same.
Change the size of the array so that the number of columns isn’t a multiple of 4, such as making it 498x498 or 499x499, and then you’ll see differences in speed between columns, where those columns where iszero(size(A,1)*(col_num-1) % 4) will be faster than those where that isn’t true.

Note that the speed difference wont be as extreme as my example. If you’re loading/storing 8 numbers at a time, every load will miss when not aligned to 8-number boundaries. But if you’re misaligned and only loading/storing 4 at a time, half of the loads and stores will still fall in between the boundaries!

5 Likes

That’s fascinating. Thanks for the clear explanation!

(Btw, I just wanted to say I love your work on LoopVectorization.jl. I’m looking forward to seeing more SIMD magic from you!).