What is the fastest way to update a particular row of a matrix using dot syntax?

Elrod · June 23, 2020, 8:40pm

To reproduced that on your own computer, use M = rand(498, 498); or M = rand(502, 502); instead.
But the phenomenon won’t be as extreme.

Basically, the loop without @inbounds isn’t slow because checking bounds is slow, but because currently checking bounds prevents SIMD (doing multiple things per CPU instruction).
When you load or store memory with a SIMD instruction, it takes about twice as long when that load/store crosses a 64-byte boundary (or, a CPU is capable of half as many such operations per clock cycle).
64-bytes is 8 Float64 (8*sizeof(Float64) == 64), and the start of big Arrays in Julia are aligned to such a boundary.

Let’s start with column 3, which is 1000 elements from the start. We have such a boundary every 8 elements. So because 1000 is a multiple of 8, that means we have:

boundary between 1000 and 1001
load and then store numbers from 1001 to 1008
boundary between 1008 and 1009
load and then store numbers from 1009 to 10016
boundary between 1016 and 1017

etc. That is, our loads and stores never cross one of these boundaries.

Low lets look at column 2. It starts 500 elements from the beginning of the array. 500 is not a multiple of 8, but 496 and 504 are. So we have

load and then store numbers from 501 to 508, crossing 504-505 boundary
load and then store numbers from 509 to 516, crossing 512-513 boundary

etc. Every one of the loads/stores crosses such a boundary, making them all slow.
Note that most laptop and desktop computers will load and store 4 Float64 at a time instead of 8. This means for them, things would instead look like this

load and then store numbers from 501 to 504
boundary between 504 and 505
load and then store numbers from 505 to 508
load and then store numbers from 509 to 512
boundary between 512 and 513
load and then store numbers from 513 to 516

Meaning it will avoid crossing any such boundaries, so performance should be the same.
Change the size of the array so that the number of columns isn’t a multiple of 4, such as making it 498x498 or 499x499, and then you’ll see differences in speed between columns, where those columns where iszero(size(A,1)*(col_num-1) % 4) will be faster than those where that isn’t true.

Note that the speed difference wont be as extreme as my example. If you’re loading/storing 8 numbers at a time, every load will miss when not aligned to 8-number boundaries. But if you’re misaligned and only loading/storing 4 at a time, half of the loads and stores will still fall in between the boundaries!

tomohiro_soejima · June 25, 2020, 7:58pm

That’s fascinating. Thanks for the clear explanation!

(Btw, I just wanted to say I love your work on LoopVectorization.jl. I’m looking forward to seeing more SIMD magic from you!).

Topic		Replies	Views
Fast diag(A'*B) Numerics question	6	674	October 1, 2019
Diagonal elements of matrix product General Usage question , performance , linearalgebra	12	2580	November 2, 2022
Fast `diag(A' * B * A)` General Usage performance , linearalgebra	22	862	May 4, 2023
Why is dot plus faster? Performance question	6	515	May 28, 2022
Speed up sparse matrix multiplication Performance	8	1735	August 31, 2020

What is the fastest way to update a particular row of a matrix using dot syntax?

Related topics