# Improving performance for matrix assembly

I have a working code for a large project in scientific computation. Compared to similar existing codes in C, the julia code is around 10 times slower, which is mostly likely due to me being new to Julia. I’ve tried profiling the code but frankly the output is quite overwhelming and I can’t seem to make much traction there.

From what I could distill from the various other posts on this and SO, here is my question with a MWE below: How would you make this code for matrix assembly faster? Please note that this is (obviously) a dummy scaffold, with the real `ker` function being much more complicated.

``````#use to set a matrix block
function ker(n)
return rand(n,n)
end

#main body that assembles a matrix block by block
#each block is of size n x n
function mbody()
m = 100
n = 5
A = zeros(n*m,n*m)
b = zeros(n*m)
for i=1:m
i_idx = 1+(i-1)*n:i*n
for j=1:m
j_idx = 1+(j-1)*n:j*n
A[i_idx,j_idx] = ker(n)
end
b[i_idx] = rand(n)
end
end
``````

the slow thing here is that you’re copying a ton of giant arrays and throwing them away. instead, pass a view into ker and fill that.

1 Like

Concretely,

``````julia> function ker!(A)
for i in eachindex(A)
A[i] = rand()
end
end
ker! (generic function with 2 methods)

julia> function mbody2()
m = 100
n = 5
A = zeros(n*m,n*m)
b = zeros(n*m)
for i=1:m
i_idx = 1+(i-1)*n:i*n
for j=1:m
j_idx = 1+(j-1)*n:j*n
ker!(@views A[i_idx,j_idx])
end
b[i_idx] = rand(n)
end
end
mbody2 (generic function with 1 method)

julia> @btime mbody()
2.373 ms (10103 allocations: 4.36 MiB)

julia> @btime mbody2()
1.089 ms (103 allocations: 1.92 MiB)
``````
1 Like

Thanks a lot @stillyslalom! The improvement in memory and time is great. When I tried to implement this idea, I ran into some trouble because in my case, the function `ker` accepts multiple argument. Concretely:

(1) If I replace `ker!(@views A[i_idx,j_idx])` by `ker!(@views A[i_idx,j_idx],4)` where `4` is an extra argument (as an example) and modify the function definition to be `ker!(A,z)` I get an error of the sort: `ERROR: MethodError: no method matching setindex!(::Tuple{SubArray{Float64, 2, Matrix{Float64}, Tuple{UnitRange{Int64}, UnitRange{Int64}}, false}, Int64}, ::Float64, ::Int64)`.

(2) On the other hand, if I replace `ker!(@views A[i_idx,j_idx])` by `ker!(A,i_idx,j_idx,4)` and within the body of `ker!(A,idx1,idx2,z)` do `A[idx1,idx2] .= rand()+z`, I get no errors and the same improvement in performance as you pointed out in your post.

Can you help me understand why (1) failed, and if (2) is just as good as your original solution? Thanks!

`@views` seems to be doing something with the other argument.
You could try to just use `@view` (which should only convert the next expression into a view) or assign the view to a variable before, which usually makes things more readable anyway:

``````Aview = @view A[...]
ker!(Aview,4)
``````
2 Likes

Thanks @Salmon. This does work, but doesn’t seem as optimal. If I follow the approach suggested by me in (2) above, i.e. send in `A` along with the indexing range, then I observe the following (using `@benchmark`):
a) Memory and allocs estimate is identical
b) The approach using `Aview` is slower by about 15-20%
Why would this be happening?

Update: The solution was as simple as to wrap `@view .. ` inside `()` brackets. It has comparable resource utilization as the approach using `Aview`. Thanks to this link.

1 Like

That is weird, afaik the performance should be nearly identical, assuming you are iterating over the correct memory layout in the `ker!` function.

Do you mean if you wrap the `@view` in brackets the performance loss is gone?
Im pretty sure

``````ker!( (@view A[...]),4)
``````

should be lowered to

``````Aview = (@view A[...]
ker!( Aview,4)
``````

so Id expect these things to do exactly the same.

Building on @stillyslalom’s implementation, I have changed the indexing order (also put everything in the function), but I don’t think it makes much of a difference:

``````julia> function mbody3()
m = 100
n = 5
A = zeros(n*m,n*m)
b = rand(n*m)
@inbounds for j=1:m
j_idx = 1+(j-1)*n:j*n
for i=1:m
i_idx = 1+(i-1)*n:i*n
for jj in j_idx
for ii in i_idx
A[ii, jj] = rand()
end
end
end
end
nothing
end
``````

As Julia is column-major, it’s best to get in the habit of iteration over the left-most index first (i.e. swap the i and j from the example). This can make a difference in performance sometimes, but I haven’t run this code on a desktop machine to check.

Also, if you need to set parts of b in the loop, you can get an in-place `rand!` from the `Random` standard library.

1 Like

Sorry, I should have been clearer. What I meant was that `ker!(@views A[i_idx,j_idx],4)` gave an error, but doing `ker!((@views A[i_idx,j_idx]),4)` resolved the error. The performance gap is still there.

1 Like

Ah, I think I understand:
The following two functions do not do the same:

``````function ker!(A,z)
for i in eachindex(A)
A[i] = rand() + z
end
end
function ker!(A,z)
A .= rand()+z
end
``````

The first function generates a random number for each index of `A` and writes it to it.
The second function generates one random number and fills `A` with it.
Since the second function only has to generate one random number, it is likely faster.

1 Like

You should use `@view`, not `@views`. You can also use parens:

``````ker!(@view(A[i_idx,j_idx]),4)
``````

Note that you can use parens the same way as with function calls.

1 Like

Sorry for the confusion–I accidentally copy-pasted that broken definition from the REPL. The version I actually ran for timings includes parentheses around `@views`.

Or just put `@views` in front of `function` to make every slice in the function use views. Or, if you want to annotate use of views at a finer-grained level, I think it is clearer to write:

``````@views ker!(A[i_idx,j_idx],4)
``````

(I think a lot of people still don’t grok how the `@views` macro allows you to opt-in to views for arbitrary block of code.)

Tip: If you change this to

``````@. A = rand() + z
``````

which is shorthand for `A .= rand.() .+ z`, then it will call `rand()` separately for each element of `A`, just as for your explicit loop.

3 Likes

Yes, but it is better to use the `@view` macro when applying it locally, than to use `@views` and then awkwardly limiting its scope with `(@views A[i, j])`.

Is there a practical difference between `@view(A[i, j])` and `@views(A[i, j])`?

I don’t think so, it’s about being idiomatic, mostly.

1 Like

The quoted code calls `rand()` for each element. Generating random numbers takes a bit of effort and has specialized code for generating many numbers at once, hence:

``````rand(nrows, ncols)
``````

is significantly faster for larger matrices. In this case:

``````A = rand(n*m,n*m)
``````

would accomplish the operation of the whole loop.

2 Likes

In this specific case no, but for multi-argument macros, arguments need to be separated by comma instead of space

``````(@macroname arg1 arg2 arg3)
@macroname(arg1, arg2, arg3)
``````

Firstly, the actual `ker` does not return random numbers, so this is slightly beside the point. But, accepting `rand` for the sake of argument:

The above re-assigns the variable `A`, while the loop you quoted works in-place. Of course, you can write `A .= rand(n*m, n*m)`. So this is faster in terms of the `rand` part, but you need to allocate a temporary array, and then copy over the data. It’s not clear to me that this is always faster.

In terms of the sizes in question, `ker` returns a 5x5 matrix, so:

``````julia> @btime A[301:305, 301:305] .= rand(5, 5) setup=(A=rand(500,500));
145.933 ns (1 allocation: 256 bytes)

julia> @btime A[301:305, 301:305] .= rand.() setup=(A=rand(500,500));
41.616 ns (0 allocations: 0 bytes)
``````

Even for larger sizes, this holds up.

The comment was about the general performance different between generating many random values at once vs. one-by-one, which exists. The kernel would obviously not be random in applications.

Replaces the whole loop and the allocation of `A` with `zeros`. Therefore there is no need for `.=` syntax for in-place operation. The initial zero-filling is unnecessary as well in this case.

Finally, regarding the `5x5` size, the whole `A` is larger, and my comment specifically says it applies to larger matrices:

``````julia> function test()
A = Matrix{Float64}(undef, 100,100)
for i in 1:100
for j in 1:100
A[j,i] = rand()
end
end
A
end
test (generic function with 1 method)

julia> @btime test();
32.231 μs (2 allocations: 78.17 KiB)

julia> @btime rand(100,100);
11.287 μs (2 allocations: 78.17 KiB)
``````

this 3x difference is substantial (even though most of it is due to `@inbounds` missing - though still 50% difference with `@inbounds`).

Finally, this intuition regarding bunching of random generation for efficiency is useful to have. But it may not be pertinent to this case.