I have a working code for a large project in scientific computation. Compared to similar existing codes in C, the julia code is around 10 times slower, which is mostly likely due to me being new to Julia. I’ve tried profiling the code but frankly the output is quite overwhelming and I can’t seem to make much traction there.

From what I could distill from the various other posts on this and SO, here is my question with a MWE below: How would you make this code for matrix assembly faster? Please note that this is (obviously) a dummy scaffold, with the real ker function being much more complicated.

#use to set a matrix block
function ker(n)
return rand(n,n)
end
#main body that assembles a matrix block by block
#each block is of size n x n
function mbody()
m = 100
n = 5
A = zeros(n*m,n*m)
b = zeros(n*m)
for i=1:m
i_idx = 1+(i-1)*n:i*n
for j=1:m
j_idx = 1+(j-1)*n:j*n
A[i_idx,j_idx] = ker(n)
end
b[i_idx] = rand(n)
end
end

julia> function ker!(A)
for i in eachindex(A)
A[i] = rand()
end
end
ker! (generic function with 2 methods)
julia> function mbody2()
m = 100
n = 5
A = zeros(n*m,n*m)
b = zeros(n*m)
for i=1:m
i_idx = 1+(i-1)*n:i*n
for j=1:m
j_idx = 1+(j-1)*n:j*n
ker!(@views A[i_idx,j_idx])
end
b[i_idx] = rand(n)
end
end
mbody2 (generic function with 1 method)
julia> @btime mbody()
2.373 ms (10103 allocations: 4.36 MiB)
julia> @btime mbody2()
1.089 ms (103 allocations: 1.92 MiB)

Thanks a lot @stillyslalom! The improvement in memory and time is great. When I tried to implement this idea, I ran into some trouble because in my case, the function ker accepts multiple argument. Concretely:

(1) If I replace ker!(@views A[i_idx,j_idx]) by ker!(@views A[i_idx,j_idx],4) where 4 is an extra argument (as an example) and modify the function definition to be ker!(A,z) I get an error of the sort: ERROR: MethodError: no method matching setindex!(::Tuple{SubArray{Float64, 2, Matrix{Float64}, Tuple{UnitRange{Int64}, UnitRange{Int64}}, false}, Int64}, ::Float64, ::Int64).

(2) On the other hand, if I replace ker!(@views A[i_idx,j_idx]) by ker!(A,i_idx,j_idx,4) and within the body of ker!(A,idx1,idx2,z) do A[idx1,idx2] .= rand()+z, I get no errors and the same improvement in performance as you pointed out in your post.

Can you help me understand why (1) failed, and if (2) is just as good as your original solution? Thanks!

@views seems to be doing something with the other argument.
You could try to just use @view (which should only convert the next expression into a view) or assign the view to a variable before, which usually makes things more readable anyway:

Thanks @Salmon. This does work, but doesn’t seem as optimal. If I follow the approach suggested by me in (2) above, i.e. send in A along with the indexing range, then I observe the following (using @benchmark):
a) Memory and allocs estimate is identical
b) The approach using Aview is slower by about 15-20%
Why would this be happening?

Update: The solution was as simple as to wrap @view .. inside () brackets. It has comparable resource utilization as the approach using Aview. Thanks to this link.

Building on @stillyslalom’s implementation, I have changed the indexing order (also put everything in the function), but I don’t think it makes much of a difference:

julia> function mbody3()
m = 100
n = 5
A = zeros(n*m,n*m)
b = rand(n*m)
@inbounds for j=1:m
j_idx = 1+(j-1)*n:j*n
for i=1:m
i_idx = 1+(i-1)*n:i*n
for jj in j_idx
for ii in i_idx
A[ii, jj] = rand()
end
end
end
end
nothing
end

As Julia is column-major, it’s best to get in the habit of iteration over the left-most index first (i.e. swap the i and j from the example). This can make a difference in performance sometimes, but I haven’t run this code on a desktop machine to check.

Also, if you need to set parts of b in the loop, you can get an in-place rand! from the Random standard library.

Sorry, I should have been clearer. What I meant was that ker!(@views A[i_idx,j_idx],4) gave an error, but doing ker!((@views A[i_idx,j_idx]),4) resolved the error. The performance gap is still there.

Ah, I think I understand:
The following two functions do not do the same:

function ker!(A,z)
for i in eachindex(A)
A[i] = rand() + z
end
end
function ker!(A,z)
A .= rand()+z
end

The first function generates a random number for each index of A and writes it to it.
The second function generates one random number and fills A with it.
Since the second function only has to generate one random number, it is likely faster.

Sorry for the confusion–I accidentally copy-pasted that broken definition from the REPL. The version I actually ran for timings includes parentheses around @views.

Or just put @views in front of function to make every slice in the function use views. Or, if you want to annotate use of views at a finer-grained level, I think it is clearer to write:

@views ker!(A[i_idx,j_idx],4)

(I think a lot of people still don’t grok how the @views macro allows you to opt-in to views for arbitrary block of code.)

Tip: If you change this to

@. A = rand() + z

which is shorthand for A .= rand.() .+ z, then it will call rand() separately for each element of A, just as for your explicit loop.

The quoted code calls rand() for each element. Generating random numbers takes a bit of effort and has specialized code for generating many numbers at once, hence:

rand(nrows, ncols)

is significantly faster for larger matrices. In this case:

Firstly, the actual ker does not return random numbers, so this is slightly beside the point. But, accepting rand for the sake of argument:

The above re-assigns the variable A, while the loop you quoted works in-place. Of course, you can write A .= rand(n*m, n*m). So this is faster in terms of the rand part, but you need to allocate a temporary array, and then copy over the data. It’s not clear to me that this is always faster.

In terms of the sizes in question, ker returns a 5x5 matrix, so:

The comment was about the general performance different between generating many random values at once vs. one-by-one, which exists. The kernel would obviously not be random in applications.

Replaces the whole loop and the allocation of A with zeros. Therefore there is no need for .= syntax for in-place operation. The initial zero-filling is unnecessary as well in this case.

Finally, regarding the 5x5 size, the whole A is larger, and my comment specifically says it applies to larger matrices:

julia> function test()
A = Matrix{Float64}(undef, 100,100)
for i in 1:100
for j in 1:100
A[j,i] = rand()
end
end
A
end
test (generic function with 1 method)
julia> @btime test();
32.231 μs (2 allocations: 78.17 KiB)
julia> @btime rand(100,100);
11.287 μs (2 allocations: 78.17 KiB)

this 3x difference is substantial (even though most of it is due to @inbounds missing - though still 50% difference with @inbounds).

Finally, this intuition regarding bunching of random generation for efficiency is useful to have. But it may not be pertinent to this case.