Performance of a transpose compared to c and cache friendliness

Your loops are in the wrong order.

Also, there is not much point in blocking loops like this that make only a single pass over the array, in order, e.g. to fill it. You only want to block in order to increase temporal locality (e.g. in matrix multiplication, see e.g. this Julia notebook) and/or spatial locality (e.g. for matrix transposition). Update: sorry, I missed that you are computing A+B^T, see below.

1 Like