Populating matrix with elements - improve performance

Does anybody have suggestions on how to speed up the following MWE?

using BenchmarkTools

function kernel0(t,r) # MWE
    K = Array{Float64}(undef,length(t),length(r))
    for (it,t_) in enumerate(t)
        if t_ == 0.0  # happens typically for 1 element of t
            K[it,:] .= 1.0
            ϕ = 2π*52.0*t_./r.^3
            for (ir,ϕ_) in enumerate(ϕ)
                z = sqrt(6ϕ_/π)
                FS,FC = sincos(z) # sincos() here is a placeholder for function of similar cost
                s,c = sincos(ϕ_)  # this is actually sincos()
                K[it,ir] = (FC*c + FS*s)/z

# typical length of t and r is 100 to 500, not necessarily the same length
t = range(0,stop=3,length=200)
r = range(1,stop=7,length=200) 

@benchmark kernel0($t,$r)

@code_warntype indicates the function is type stable, but there are still some allocations happening as indicated by @benchmark. The function allocates K , as it is called with arguments of varying length.

Any help would be appreciated - thank you!

The first thing that comes to mind is accessing elements in memory order. In particular, you’ll get better performance if you transpose K so that your innermost loop goes along its columns rather than rows.


You can also initialize phi only once and overwrite it, otherwise every iteration it will allocate a new array.


The above comments are great, and to add on: I personally also like to move as much as possible into a separate kernel function and then make the function that populates an array separate:

kernel(x, y, other_args...) = ...
function kernelmatrix(kernelfunction, x_list, y_list, other_args...)
  buf = Array{...}(undef, length(x_list), length(y_list)
  for k in 1:length(y_list) # Threads.@threads?
    yk = y_list[k]
    for j in 1:length(x_list) # LoopVectorization.turbo?
      @inbounds buf[j,k] = kernelfunction(x_list[j], yk, other_args...)

To be clear, I wouldn’t actually splat/slurp arguments like that in something that you want to be max performance (although maybe the compiler is actually good enough that it wouldn’t really hurt you), but my point is primarily to refactor a bit and move the kernel evaluations into its own function. That may not make much of a performance difference, but at the very least it de-clutters the matrix assembly code and makes optimizations a bit easier to see. I’ve put a few in the comments in the above snippet as an example. And beyond making the code cleaner and more modular, I have seen some cases where it can help performance. Particularly if it forces you to work harder to make no allocations in the calculations for individual entries, because that will really help with getting a good speedup with multithreading.

EDIT: oh, also—I would just do a bit more pre-processing. Like that row where you expect it to be all ones can presumably just be handled manually.