Improving performance for matrix assembly

I have a working code for a large project in scientific computation. Compared to similar existing codes in C, the julia code is around 10 times slower, which is mostly likely due to me being new to Julia. I’ve tried profiling the code but frankly the output is quite overwhelming and I can’t seem to make much traction there.

From what I could distill from the various other posts on this and SO, here is my question with a MWE below: How would you make this code for matrix assembly faster? Please note that this is (obviously) a dummy scaffold, with the real ker function being much more complicated.

#use to set a matrix block
function ker(n)
    return rand(n,n)
end

#main body that assembles a matrix block by block
#each block is of size n x n
function mbody()
    m = 100
    n = 5
    A = zeros(n*m,n*m)
    b = zeros(n*m)
    for i=1:m
        i_idx = 1+(i-1)*n:i*n
        for j=1:m
            j_idx = 1+(j-1)*n:j*n
            A[i_idx,j_idx] = ker(n)
        end
        b[i_idx] = rand(n)
    end
end

the slow thing here is that you’re copying a ton of giant arrays and throwing them away. instead, pass a view into ker and fill that.

1 Like

Concretely,

julia> function ker!(A)
          for i in eachindex(A)
           A[i] = rand()
          end
       end
ker! (generic function with 2 methods)

julia> function mbody2()
           m = 100
           n = 5
           A = zeros(n*m,n*m)
           b = zeros(n*m)
           for i=1:m
               i_idx = 1+(i-1)*n:i*n
               for j=1:m
                   j_idx = 1+(j-1)*n:j*n
                   ker!(@views A[i_idx,j_idx])
               end
               b[i_idx] = rand(n)
           end
       end
mbody2 (generic function with 1 method)

julia> @btime mbody()
  2.373 ms (10103 allocations: 4.36 MiB)

julia> @btime mbody2()
  1.089 ms (103 allocations: 1.92 MiB)
1 Like

Thanks a lot @stillyslalom! The improvement in memory and time is great. When I tried to implement this idea, I ran into some trouble because in my case, the function ker accepts multiple argument. Concretely:

(1) If I replace ker!(@views A[i_idx,j_idx]) by ker!(@views A[i_idx,j_idx],4) where 4 is an extra argument (as an example) and modify the function definition to be ker!(A,z) I get an error of the sort: ERROR: MethodError: no method matching setindex!(::Tuple{SubArray{Float64, 2, Matrix{Float64}, Tuple{UnitRange{Int64}, UnitRange{Int64}}, false}, Int64}, ::Float64, ::Int64).

(2) On the other hand, if I replace ker!(@views A[i_idx,j_idx]) by ker!(A,i_idx,j_idx,4) and within the body of ker!(A,idx1,idx2,z) do A[idx1,idx2] .= rand()+z, I get no errors and the same improvement in performance as you pointed out in your post.

Can you help me understand why (1) failed, and if (2) is just as good as your original solution? Thanks!

@views seems to be doing something with the other argument.
You could try to just use @view (which should only convert the next expression into a view) or assign the view to a variable before, which usually makes things more readable anyway:

Aview = @view A[...]
ker!(Aview,4)
2 Likes

Thanks @Salmon. This does work, but doesn’t seem as optimal. If I follow the approach suggested by me in (2) above, i.e. send in A along with the indexing range, then I observe the following (using @benchmark):
a) Memory and allocs estimate is identical
b) The approach using Aview is slower by about 15-20%
Why would this be happening?

Update: The solution was as simple as to wrap @view .. inside () brackets. It has comparable resource utilization as the approach using Aview. Thanks to this link.

1 Like

That is weird, afaik the performance should be nearly identical, assuming you are iterating over the correct memory layout in the ker! function.

Do you mean if you wrap the @view in brackets the performance loss is gone?
Im pretty sure

ker!( (@view A[...]),4)

should be lowered to

Aview = (@view A[...]
ker!( Aview,4)

so Id expect these things to do exactly the same.

Building on @stillyslalom’s implementation, I have changed the indexing order (also put everything in the function), but I don’t think it makes much of a difference:

julia> function mbody3()
                  m = 100
                  n = 5
                  A = zeros(n*m,n*m)
                  b = rand(n*m)
                  @inbounds for j=1:m
                      j_idx = 1+(j-1)*n:j*n
                      for i=1:m
                           i_idx = 1+(i-1)*n:i*n
                           for jj in j_idx
                               for ii in i_idx
                                   A[ii, jj] = rand()
                               end
                           end
                      end
                  end
                  nothing
              end

As Julia is column-major, it’s best to get in the habit of iteration over the left-most index first (i.e. swap the i and j from the example). This can make a difference in performance sometimes, but I haven’t run this code on a desktop machine to check.

Also, if you need to set parts of b in the loop, you can get an in-place rand! from the Random standard library.

1 Like

Sorry, I should have been clearer. What I meant was that ker!(@views A[i_idx,j_idx],4) gave an error, but doing ker!((@views A[i_idx,j_idx]),4) resolved the error. The performance gap is still there.

1 Like

Ah, I think I understand:
The following two functions do not do the same:

function ker!(A,z)
    for i in eachindex(A)
        A[i] = rand() + z
    end
end
function ker!(A,z)
    A .= rand()+z
end

The first function generates a random number for each index of A and writes it to it.
The second function generates one random number and fills A with it.
Since the second function only has to generate one random number, it is likely faster.

1 Like

You should use @view, not @views. You can also use parens:

ker!(@view(A[i_idx,j_idx]),4)

Note that you can use parens the same way as with function calls.

1 Like

Sorry for the confusion–I accidentally copy-pasted that broken definition from the REPL. The version I actually ran for timings includes parentheses around @views.

Or just put @views in front of function to make every slice in the function use views. Or, if you want to annotate use of views at a finer-grained level, I think it is clearer to write:

@views ker!(A[i_idx,j_idx],4)

(I think a lot of people still don’t grok how the @views macro allows you to opt-in to views for arbitrary block of code.)

Tip: If you change this to

@. A = rand() + z

which is shorthand for A .= rand.() .+ z, then it will call rand() separately for each element of A, just as for your explicit loop.

3 Likes

Yes, but it is better to use the @view macro when applying it locally, than to use @views and then awkwardly limiting its scope with (@views A[i, j]).

Is there a practical difference between @view(A[i, j]) and @views(A[i, j])?

I don’t think so, it’s about being idiomatic, mostly.

1 Like

The quoted code calls rand() for each element. Generating random numbers takes a bit of effort and has specialized code for generating many numbers at once, hence:

rand(nrows, ncols)

is significantly faster for larger matrices. In this case:

A = rand(n*m,n*m)

would accomplish the operation of the whole loop.

2 Likes

In this specific case no, but for multi-argument macros, arguments need to be separated by comma instead of space

(@macroname arg1 arg2 arg3)
@macroname(arg1, arg2, arg3)

Firstly, the actual ker does not return random numbers, so this is slightly beside the point. But, accepting rand for the sake of argument:

The above re-assigns the variable A, while the loop you quoted works in-place. Of course, you can write A .= rand(n*m, n*m). So this is faster in terms of the rand part, but you need to allocate a temporary array, and then copy over the data. It’s not clear to me that this is always faster.

In terms of the sizes in question, ker returns a 5x5 matrix, so:

julia> @btime A[301:305, 301:305] .= rand(5, 5) setup=(A=rand(500,500));
  145.933 ns (1 allocation: 256 bytes)

julia> @btime A[301:305, 301:305] .= rand.() setup=(A=rand(500,500));
  41.616 ns (0 allocations: 0 bytes)

Even for larger sizes, this holds up.

The comment was about the general performance different between generating many random values at once vs. one-by-one, which exists. The kernel would obviously not be random in applications.

Replaces the whole loop and the allocation of A with zeros. Therefore there is no need for .= syntax for in-place operation. The initial zero-filling is unnecessary as well in this case.

Finally, regarding the 5x5 size, the whole A is larger, and my comment specifically says it applies to larger matrices:

julia> function test()
       A = Matrix{Float64}(undef, 100,100)
       for i in 1:100
           for j in 1:100
               A[j,i] = rand()
           end
       end
       A
       end
test (generic function with 1 method)

julia> @btime test();
  32.231 μs (2 allocations: 78.17 KiB)

julia> @btime rand(100,100);
  11.287 μs (2 allocations: 78.17 KiB)

this 3x difference is substantial (even though most of it is due to @inbounds missing - though still 50% difference with @inbounds).

Finally, this intuition regarding bunching of random generation for efficiency is useful to have. But it may not be pertinent to this case.