Repeated Convolutions With Large 1D Arrays

What’s the optimal way to compute repeated convolutions with large 1D arrays? the filters, (g_x,g_y, & g_z) stay the same each iteration, but the “signals” change. To be clear, the computation of the f’s is fast, the convolutions are what is slowing me down.

for i in 1:iters

    #input function for x
    f_x .= x_fun(s,ẑ, a_xx,b_zx) # 2500001-element Vector{Float64}
    #input function for y
    f_y .= y_fun(x̂,new_z, a_xy, b_zy) # 2500001-element Vector{Float64} 
    #input function for z 
    f_z .= z_fun(ŷ, a,b, z_p) # 2500001-element Vector{Float64}

    #Get estimates
    x̂ .= fftfilt(g_x,δ*f_x)[1:end] # g_x: 1001-element Vector{Float64} 
    ŷ .= fftfilt(g_y,δ*f_y)[1:end] # g_y: 1001-element Vector{Float64}
    ẑ .= fftfilt(g_z,δ*f_z)[1:end] # g_z: 100001-element Vector{Float64}
    
end

Thanks,

DS

This isn’t going to be the improvement you need, but you can omit [1:end] because it is completely unnecessary. It’s making a copy of the filter result before the element-wise assignment to the left-hand side. It costs time and increases memory pressure and GC overhead.

You could use fftfilt! to put the filter results directly into your left-hand variables and avoid fftfilt’s internal result allocation.

There are also small bits of redundant work in fftfilt. Each call redundantly computes the optimal FFT length, plans FFTs, and applies the FFT to the g filters. These could be extracted, but they are still small compared to the multitude of FFT/inverse FFT applications on the overlapped windows of the signals. Likely the biggest possible gain would be to parallelize those.

Ahh. Thx for the tips, Jeff_Emanuel! Will implement. Didn’t realize the [1:end] was an issue. Using threads seems like an obvious choice - will give it a shot.

What about precomputing the FFT’s for the filters outside of the loop with FFTW.jl? Or using Metal.jl (I have a mac)?

Thanks again,

DS

Yes, I would definitely recommend this — that will save you one of the 3 FFTs (i.e. ≈ 33%). Also precomputing an FFTW “plan” (and adding the FFTW.MEASURE or FFTW.PATIENT option, which takes longer to create but runs faster). Use real-to-complex and complex-to-real FFTs if your data are real.

(And if you have some freedom to choose the transform length, choose a highly composite size, i.e. factors of 2,3,5. Using a length 2500001, which has a large prime factor of 18797, will be many times slower for FFTs. If you are doing a linear convolution you should be zero-padding anyway, which lets you choose the length to pad to.)

Note that δ*f_x allocates a new array, so I would include δ .* … in the definition of your arrays. And x̂ .= fftfilt(...) still allocates a new array for the rhs. But if you use a precomputed-FFT plan, you can act it in-place on precomputed outputs with mul!.

1 Like