Populating matrix with elements - improve performance

stst · August 2, 2021, 9:22pm

Does anybody have suggestions on how to speed up the following MWE?

using BenchmarkTools

function kernel0(t,r) # MWE
    K = Array{Float64}(undef,length(t),length(r))
    for (it,t_) in enumerate(t)
        if t_ == 0.0  # happens typically for 1 element of t
            K[it,:] .= 1.0
        else
            ϕ = 2π*52.0*t_./r.^3
            for (ir,ϕ_) in enumerate(ϕ)
                z = sqrt(6ϕ_/π)
                FS,FC = sincos(z) # sincos() here is a placeholder for function of similar cost
                s,c = sincos(ϕ_)  # this is actually sincos()
                K[it,ir] = (FC*c + FS*s)/z
            end
        end
    end
end

# typical length of t and r is 100 to 500, not necessarily the same length
t = range(0,stop=3,length=200)
r = range(1,stop=7,length=200) 

@benchmark kernel0($t,$r)

@code_warntype indicates the function is type stable, but there are still some allocations happening as indicated by @benchmark. The function allocates K , as it is called with arguments of varying length.

Any help would be appreciated - thank you!

rdeits · August 2, 2021, 9:39pm

The first thing that comes to mind is accessing elements in memory order. In particular, you’ll get better performance if you transpose K so that your innermost loop goes along its columns rather than rows.

gbaraldi · August 2, 2021, 9:59pm

You can also initialize phi only once and overwrite it, otherwise every iteration it will allocate a new array.

cgeoga · August 2, 2021, 10:27pm

The above comments are great, and to add on: I personally also like to move as much as possible into a separate kernel function and then make the function that populates an array separate:

kernel(x, y, other_args...) = ...
function kernelmatrix(kernelfunction, x_list, y_list, other_args...)
  buf = Array{...}(undef, length(x_list), length(y_list)
  for k in 1:length(y_list) # Threads.@threads?
    yk = y_list[k]
    for j in 1:length(x_list) # LoopVectorization.turbo?
      @inbounds buf[j,k] = kernelfunction(x_list[j], yk, other_args...)
    end
  end
  buf
end

To be clear, I wouldn’t actually splat/slurp arguments like that in something that you want to be max performance (although maybe the compiler is actually good enough that it wouldn’t really hurt you), but my point is primarily to refactor a bit and move the kernel evaluations into its own function. That may not make much of a performance difference, but at the very least it de-clutters the matrix assembly code and makes optimizations a bit easier to see. I’ve put a few in the comments in the above snippet as an example. And beyond making the code cleaner and more modular, I have seen some cases where it can help performance. Particularly if it forces you to work harder to make no allocations in the calculations for individual entries, because that will really help with getting a good speedup with multithreading.

EDIT: oh, also—I would just do a bit more pre-processing. Like that row where you expect it to be all ones can presumably just be handled manually.

Topic		Replies	Views
Hot loop updating array elements with product over other array Performance	15	839	August 17, 2018
Why doesn't multithreading help here? Performance	12	1411	August 22, 2020
Help Improving Performance of a Loop Performance performance , loops	15	1083	February 16, 2021
Multithreading an embarrassingly parallel algorithm increases garbage collection Performance multithreading , memory , memory-allocation , garbage-collection	12	1965	March 1, 2021
How to improve the calculation speed of this kernel on GPU? GPU	2	301	February 2, 2023

Populating matrix with elements - improve performance

Related topics