Improving speed of convolutions of same sized arrays

Hi everyone,

I have been trying to improve the performance of a model that I am using to fit experimental data with LsqFit. Unfortunately, after calculating the theoretical solution it must be convolved with a same sized array instrument response (experimentally measured) and then fit with LsqFit. (which has to iteratively call it multiple times) The convolution step is about 10x longer than producing the theoretical curve (benchmarks shown below) and then LsqFit takes about 100x longer to converge.

Code

using DSP, BenchmarkTools

## theoretical model
function model(t, β::Array{Float64,1}, ρ::Float64)

    Rt1 = Array{Float64}(undef, length(t))
    Rt2 = Array{Float64}(undef, length(t))
	A::Float64 = - 1/β[2]
    B::Float64 = 2/β[2] + 1/β[2]

	Threads.@threads for n in eachindex(t)
		
   		Rt1[n] = -exp(-(ρ^2/(t[n])) - β[1]*t[n])
   		Rt1[n] = Rt1[n]/(t[n]^(5/2))

   		Rt2[n] = A*exp(-(A^2/(t[n]))) - B*exp(-(B^2/(t[n])))
		
	end
	
    Rt = replace!(Rt1.*Rt2, NaN => 0)
    return Rt./maximum(Rt)
 
 end

## function that is fit in LsqFit that convolves the model with measurement
 function conv_Rt(t, β::Array{Float64,1},  ρ::Float64, It)

	Rt = Array{Float64}(undef, length(t))
	convRt = Array{Float64}(undef, 2*length(t) - 1)

	Rt = model(t, β, ρ)

	convRT = conv(It, Rt) # slowest line
    convRt = convRt./maximum(convRt)
    
end
Things I’ve tried
  1. Take advantage that one array is always a constant (and that both are always the same length) by using a planned FFT and feeding that into the conv_RT function. I could then use mul! to compute the product with the plan and vector which is faster for a single fft calculation, though I have to make a total of two products and then the inverse transform for the convolution which ended up being slower than just the in house conv function. I’d imagine conv is calling the highly optimized FFTW library…
  2. Try to decrease the array size in the convolution as much as possible. This can be done but trickier because it would have to be a different time window than what we are fitting with LsqFit.

Benchmarks and Test Usage

t = 0:0.01:10
It = [Vector(1:length(t)/2); length(t)/2+1:-1:1]

@benchmark model(t, [0.1,10.0], 1.0)
#=BenchmarkTools.Trial: 
  memory estimate:  40.97 KiB
  allocs estimate:  66
  --------------
  minimum time:     15.443 μs (0.00% GC)
  median time:      17.000 μs (0.00% GC)
  mean time:        18.842 μs (4.91% GC)
  maximum time:     1.497 ms (95.80% GC)
  --------------
  samples:          10000
  evals/sample:     1
=#

@benchmark conv_Rt(t, [0.1,10.0],1.0,  It)
#=BenchmarkTools.Trial: 
  memory estimate:  166.73 KiB
  allocs estimate:  167
  --------------
  minimum time:     149.688 μs (0.00% GC)
  median time:      155.159 μs (0.00% GC)
  mean time:        183.873 μs (2.23% GC)
  maximum time:     6.121 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
=#

Thanks for all the help!

-Michael

1 Like

I’d imagine conv is calling the highly optimized FFTW library…

Yes but nothing fancy:

2 Likes

You should be able to do much better than the conv function if this is performance critical, by calling FFTW yourself. The reason is that you can precompute the FFT “plans” (and spend more time on the planning, and play with in-place vs. out-of-place, and use r2c transforms to exploit real data) and preallocate intermediate output arrays.

5 Likes

Excellent. Thank you for linking this.

-Michael

I thought so! My implementation trying to do this must have not been ideal. I will keep pushing on the FFT plans to see if I can improve this any. Thanks for the advice!

You may also try LoopVectorization on the loop instead of threads, chances are that the overhead of threading is high and the SIMD acceleration from LoopVectorization might end up faster. Also, if you do use LoopVectorization, you might be able to get another factor 2 speedup by using Float32.

4 Likes

Thanks everyone for the suggestions! I went ahead and used planned_rfft and preallocated all of the arrays before the fitting function as was suggested and it definitely helped!

##before benchmark of the convolution of the model
julia> @benchmark conv_DT(data.t[ind1:ind2], [0.1,10.], data)
BenchmarkTools.Trial: 
  memory estimate:  628.78 KiB
  allocs estimate:  183
  --------------
  minimum time:     296.264 μs (0.00% GC)
  median time:      306.788 μs (0.00% GC)
  mean time:        376.128 μs (4.19% GC)
  maximum time:     6.130 ms (18.48% GC)
  --------------
  samples:          10000
  evals/sample:     1

# after utilizing planned real fft
julia> @benchmark conv_DT(data.t[ind1:ind2], [0.1,10.], data, vpad, RtDT, pl, vpadRfft, upadRfft, u_vpadRfft, convpad)
BenchmarkTools.Trial: 
  memory estimate:  401.19 KiB
  allocs estimate:  133
  --------------
  minimum time:     193.093 μs (0.00% GC)
  median time:      199.485 μs (0.00% GC)
  mean time:        234.795 μs (4.33% GC)
  maximum time:     18.056 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

This resulted in about 7ms speed up when utilizing the LsqFit function!!

#old fitting method
julia> @benchmark getfit1(data,modelp)
BenchmarkTools.Trial: 
  memory estimate:  44.28 MiB
  allocs estimate:  13148
  --------------
  minimum time:     31.475 ms (3.62% GC)
  median time:      43.797 ms (2.56% GC)
  mean time:        47.030 ms (2.42% GC)
  maximum time:     106.906 ms (0.99% GC)
  --------------
  samples:          107
  evals/sample:     1
#with the pre planned
julia> @benchmark getfit(data,modelp)
BenchmarkTools.Trial: 
  memory estimate:  28.76 MiB
  allocs estimate:  9720
  --------------
  minimum time:     20.463 ms (0.00% GC)
  median time:      36.942 ms (3.08% GC)
  mean time:        39.469 ms (2.01% GC)
  maximum time:     123.894 ms (1.15% GC)
  --------------
  samples:          127
  evals/sample:     1

I’ll keep pressing on to reduce further. Thanks for all the help everyone!

-Michael

1 Like