What’s the optimal way to compute repeated convolutions with large 1D arrays? the filters, (g_x,g_y, & g_z) stay the same each iteration, but the “signals” change. To be clear, the computation of the f’s is fast, the convolutions are what is slowing me down.
for i in 1:iters
#input function for x
f_x .= x_fun(s,ẑ, a_xx,b_zx) # 2500001-element Vector{Float64}
#input function for y
f_y .= y_fun(x̂,new_z, a_xy, b_zy) # 2500001-element Vector{Float64}
#input function for z
f_z .= z_fun(ŷ, a,b, z_p) # 2500001-element Vector{Float64}
#Get estimates
x̂ .= fftfilt(g_x,δ*f_x)[1:end] # g_x: 1001-element Vector{Float64}
ŷ .= fftfilt(g_y,δ*f_y)[1:end] # g_y: 1001-element Vector{Float64}
ẑ .= fftfilt(g_z,δ*f_z)[1:end] # g_z: 100001-element Vector{Float64}
end
Thanks,
DS
This isn’t going to be the improvement you need, but you can omit [1:end]
because it is completely unnecessary. It’s making a copy of the filter result before the element-wise assignment to the left-hand side. It costs time and increases memory pressure and GC overhead.
You could use fftfilt!
to put the filter results directly into your left-hand variables and avoid fftfilt
’s internal result allocation.
There are also small bits of redundant work in fftfilt
. Each call redundantly computes the optimal FFT length, plans FFTs, and applies the FFT to the g
filters. These could be extracted, but they are still small compared to the multitude of FFT/inverse FFT applications on the overlapped windows of the signals. Likely the biggest possible gain would be to parallelize those.
Ahh. Thx for the tips, Jeff_Emanuel! Will implement. Didn’t realize the [1:end] was an issue. Using threads seems like an obvious choice - will give it a shot.
What about precomputing the FFT’s for the filters outside of the loop with FFTW.jl? Or using Metal.jl (I have a mac)?
Thanks again,
DS
Yes, I would definitely recommend this — that will save you one of the 3 FFTs (i.e. ≈ 33%). Also precomputing an FFTW “plan” (and adding the FFTW.MEASURE
or FFTW.PATIENT
option, which takes longer to create but runs faster). Use real-to-complex and complex-to-real FFTs if your data are real.
(And if you have some freedom to choose the transform length, choose a highly composite size, i.e. factors of 2,3,5. Using a length 2500001
, which has a large prime factor of 18797
, will be many times slower for FFTs. If you are doing a linear convolution you should be zero-padding anyway, which lets you choose the length to pad to.)
Note that δ*f_x
allocates a new array, so I would include δ .* …
in the definition of your arrays. And x̂ .= fftfilt(...)
still allocates a new array for the rhs. But if you use a precomputed-FFT plan, you can act it in-place on precomputed outputs with mul!
.
1 Like