# Performance of a transpose compared to c and cache friendliness

In a quest to learn good HPC principles, I am trying to understand how to implement cache friendly loops in Julia.

I found this code loop_blocking that compares performance of loop blocking techniques between c and Rust.
I tried implementing the same thing in Julia, but I’m confused by the performance difference I’m getting. I’m writing here to seek help to understand the performance of this code.

``````module LoopBlocking
const MAX::Int64 = 8192

function Fill(A::Array{Int64}, B::Array{Int64})
count::Int64 = 0
@inbounds for i = 1:MAX
for j = 1:MAX
A[i, j] = count
B[i, j] = -count
count += 1
end
end
end

for i = 1:MAX, j = 1:MAX
@inbounds A[i, j] += B[j, i]
end
return nothing
end

@inbounds A .+= transpose(B)
return nothing
end

const BLOCK_SIZE::Int64 = 32
for i = 1:BLOCK_SIZE:MAX, j = 1:BLOCK_SIZE:MAX
for ii = i:i+BLOCK_SIZE-1, jj = j:j+BLOCK_SIZE-1
@inbounds A[ii, jj] += B[jj, ii]
end
end
return nothing
end

for i = 1:BLOCK_SIZE:MAX, j = 1:BLOCK_SIZE:MAX
@inbounds @views A[i:i+BLOCK_SIZE-1, j:j+BLOCK_SIZE-1] .+= transpose(B[j:j+BLOCK_SIZE-1, i:i+BLOCK_SIZE-1])
end
return nothing
end

function main()
A::Array{Int64} = Array{Int64}(undef, MAX, MAX)
B::Array{Int64} = Array{Int64}(undef, MAX, MAX)

Fill(A, B)

println("Simple loop")
@time Fill(A, B)

Correct = copy(A)

Fill(A, B)

println("Loop blocking")
@time Fill(A, B)

@assert A == Correct

Fill(A, B)

@time Fill(A, B)

@assert A == Correct

Fill(A, B)

println("Loop blocking with views")
@time Fill(A, B)

@assert A == Correct
end

end
``````

These are the results I get

``````Simple loop
1.253246 seconds
0.841699 seconds
Loop blocking
1.238872 seconds
0.458186 seconds
1.165040 seconds
0.461195 seconds
Loop blocking with views
1.266697 seconds
0.286722 seconds
``````

And here is typical c performance

``````MAX:        8192
BLOCK_SIZE: 32
transpose_0:    528ms
transpose_1:    216ms
``````

It seems that broadcasting leads to different performance than simple loops and is the closes to c performance.
Can someone help me understand why simple for loops cannot achieve the same performance as broadcasting?

P.S.: I’m not entirely sure the last function is correct, but the assert seems to work.

My first guess without looking closely is you’re seeing the difference between row and column major arrays.

Not specific to your question but whenever you benchmark Julia code, be sure to use BenchmarkTools.jl and follow their instructions, this will yield more accurate results for comparison

Your loops are in the wrong order.

Also, there is not much point in blocking loops like this that make only a single pass over the array, in order, e.g. to fill it. You only want to block in order to increase temporal locality (e.g. in matrix multiplication, see e.g. this Julia notebook) and/or spatial locality (e.g. for matrix transposition). Update: sorry, I missed that you are computing A+B^T, see below.

1 Like

Thanks for the reply, changing the order gives the same performance as broadcasting (although, both are slightly slower than c ).

I thought that since one matrix is transpose, there shouldn’t matter which way I do the loops. I guess it’s better to follow the matrix I assign to than the one I read?

1 Like

Oh, I was only looking at your `Fill` function, and missed that you swapped the indices in subsequent functions. If you do A + B^T, then indeed there is a spatial-locality (cache-line) benefit to blocking (or, alternatively, a cache-oblivious recursive strategy), but I would still order the loops for `A` (which you both read and write) rather than `B` (which you only read).

1 Like

I would still order the loops for A (which you both read and write) rather than B (which you only read).

That is, in fact, what makes the difference between `Add_3` and `Add_1` on my machine. I used to have some C code in which I used a lot of this sort of blocking, but I don’t think I worried about distinguishing reads or writes. It’s not really all that small difference. I probably left some easy performance on the table.

1 Like