Repeated Convolutions With Large 1D Arrays

What’s the optimal way to compute repeated convolutions with large 1D arrays? the filters, (g_x,g_y, & g_z) stay the same each iteration, but the “signals” change. To be clear, the computation of the f’s is fast, the convolutions are what is slowing me down.

for i in 1:iters

    #input function for x
    f_x .= x_fun(s,ẑ, a_xx,b_zx) # 2500001-element Vector{Float64}
    #input function for y
    f_y .= y_fun(x̂,new_z, a_xy, b_zy) # 2500001-element Vector{Float64} 
    #input function for z 
    f_z .= z_fun(ŷ, a,b, z_p) # 2500001-element Vector{Float64}

    #Get estimates
    x̂ .= fftfilt(g_x,δ*f_x)[1:end] # g_x: 1001-element Vector{Float64} 
    ŷ .= fftfilt(g_y,δ*f_y)[1:end] # g_y: 1001-element Vector{Float64}
    ẑ .= fftfilt(g_z,δ*f_z)[1:end] # g_z: 100001-element Vector{Float64}
    
end

Thanks,

DS

This isn’t going to be the improvement you need, but you can omit [1:end] because it is completely unnecessary. It’s making a copy of the filter result before the element-wise assignment to the left-hand side. It costs time and increases memory pressure and GC overhead.

You could use fftfilt! to put the filter results directly into your left-hand variables and avoid fftfilt’s internal result allocation.

There are also small bits of redundant work in fftfilt. Each call redundantly computes the optimal FFT length, plans FFTs, and applies the FFT to the g filters. These could be extracted, but they are still small compared to the multitude of FFT/inverse FFT applications on the overlapped windows of the signals. Likely the biggest possible gain would be to parallelize those.

1 Like

Ahh. Thx for the tips, Jeff_Emanuel! Will implement. Didn’t realize the [1:end] was an issue. Using threads seems like an obvious choice - will give it a shot.

What about precomputing the FFT’s for the filters outside of the loop with FFTW.jl? Or using Metal.jl (I have a mac)?

Thanks again,

DS

Yes, I would definitely recommend this — that will save you one of the 3 FFTs (i.e. ≈ 33%). Also precomputing an FFTW “plan” (and adding the FFTW.MEASURE or FFTW.PATIENT option, which takes longer to create but runs faster). Use real-to-complex and complex-to-real FFTs if your data are real.

(And if you have some freedom to choose the transform length, choose a highly composite size, i.e. factors of 2,3,5. Using a length 2500001, which has a large prime factor of 18797, will be many times slower for FFTs. If you are doing a linear convolution you should be zero-padding anyway, which lets you choose the length to pad to.)

Note that δ*f_x allocates a new array, so I would include δ .* … in the definition of your arrays. And x̂ .= fftfilt(...) still allocates a new array for the rhs. But if you use a precomputed-FFT plan, you can act it in-place on precomputed outputs with mul!.

2 Likes

Thank you, @stevengj and @Jeff_Emanuel, for the thoughtful responses. Incredibly helpful. Really appreciate it. I was able to speedup the code considerably by following your advice and using threads.

This code is considerably faster:

FFTW.set_num_threads(Threads.nthreads() - 3) # 3 threads will be used for inner loop below

    for i in 1:iters        
        #Get estimates using threads to multiprocess
        Threads.@threads for j in 1:3 
            if j == 1 
                #input function for x
                f_x .= x_fun(s,ẑ,a_xx,b_zx) # f_x: 2_500_000-element Vector{Float64}                
                fftfilt!(x̂,g_x,f_x) # g_x: 1000-element Vector{Float64}
            elseif j == 2
                #input function for y
                f_y .= y_fun(x̂,ẑ, a_xy,b_zy) # 2_500_000-element Vector{Float64} 
                fftfilt!(ŷ,g_y,f_y) # g_y: 1000-element Vector{Float64}                
            else 
                #input function for z
                f_z .= z_fun(ŷ,a,b,z_p) # 2_500_000-element Vector{Float64}
                fftfilt!(ẑ,g_z,f_z) # g_z: 100000-element Vector{Float64}
            end
        end
    end

Interestingly, I didn’t notice any difference when I changed the vector lengths. I also have been unable to see much of an improvement by computing the FFT’s for the filters outside of the loops. I am sure I am doing something (many things…) wrong, but I have no idea what. Any advice would be greatly appreciated.

I have included a comparison example where despite using FFTW.PATIENT and precomputing the FFT for the filter outside the loop I am unable to beat fftfilt:

using BenchmarkTools,DSP,FFTW,LinearAlgebra

end_time = 25000 - .01 # end time for signal
filt_end = 1000 - .01 # end time for filter
δ = .01 # granularity
g = 100.0*exp.(-100.0*(0.0:δ:filt_end)); #filter, length 100_000, 

# Method 1 Precompute Patient Plan:
x = sin.(0.0:δ:end_time).^2; #signal, length 2_500_000

lx =length(x) #length of signal: 2_500_000

# Compute the full convolution length
N = lx + length(g) - 1; #length of arrays minus 1

# Create zero-padded arrays:
xp = zeros(Float64, N); 
gp = zeros(Float64, N);

#populate padded arrays
xp[1:lx] .= x;
gp[1:length(g)] .= g;


# Create FFT plans with PATIENT flag:
plan_x = plan_rfft(xp, flags=FFTW.PATIENT);
plan_g = plan_rfft(gp, flags=FFTW.PATIENT);

# Precompute FFT's:
G_fft = plan_g * gp;
X_fft = plan_x * xp;

conv_full_x = irfft(X_fft .* G_fft, N)[1:lx]; # compute convolution

@benchmark begin
    for i in 1:10
        #update x
        xp[1:lx] .+= sin.(0.0:δ:end_time).^2

        #calculate x fft
        mul!(X_fft,plan_x, xp)
        
        #inversre fft
        conv_full_x .= irfft(X_fft .* G_fft, N)[1:lx];
    end
end


#Method 2 fftfilt!:
x_filt = sin.(0.0:δ:end_time).^2 #signal array length 2_500_000

conv_full_x_filt = fftfilt(g,x); # compute convolution 

@benchmark begin
    for i in 1:10   
        #update x
        x_filt .+= sin.(0.0:δ:end_time).^2
        
        #convolve 
        fftfilt!(conv_full_x_filt,g,x_filt)
    end
end

Thanks,

DS

Don’t benchmark loops in global scope. Performance-sensitive code should be in a function. This is the number one rule when using Julia — it is literally the first performance tip.

You are using N = 2599999, which is a prime number. You should zero-pad to the next highly composite size ≥ this.

You should also precompute the irfft plan. (Also, I would use a brfft instead, and include the 1/N normalization factor in the precomputation of G_fft.

The .= is not helping you here — there’s nothing for it to fuse with on the right-hand side. You’re still allocating lots of temporary arrays (for X_fft .* G_fft, and for the slice [1:lx] since you’re not using views. Read the Julia performance tips.

4 Likes

@stevengj thank you for the lesson(s)! Code is significantly faster. Appreciate your time and patience. Incredibly helpful.

DS

Code for example discussed:

using AbstractFFTs, BenchmarkTools, DSP, FFTW, LinearAlgebra

function fftw_test()
    end_time = 25000 - .01 
    filt_end = 1000 - .01 
    δ = .01
    g = 100.0*exp.(-100.0*(0.0:δ:filt_end)); #Filter array, length 100_000

    x = sin.(0.0:δ:end_time).^2; #Input signal array, length 2_500_000
    lx = length(x) #2_500_000

    #Pad size, greater than min required size, but factor of small primes 
    N = lx + length(g); 

    #Create zero-padded vectors:
    xp = zeros(N); 
    gp = zeros(N);

    xp[1:lx] .= x;
    gp[1:length(g)] .= g;

    #Create FFT plans with PATIENT flag:
    plan_x = plan_rfft(xp, flags=FFTW.PATIENT);
    plan_g = plan_rfft(gp, flags=FFTW.PATIENT);

    fft_length = Int(N/2+1) #Length of FFT vectors
    
    #Vectors for FFT's 
    G_fft = Vector{ComplexF64}(undef,fft_length) 
    X_fft = Vector{ComplexF64}(undef,fft_length)

    #Compute FFT's:
    mul!(G_fft,plan_g,gp/N)
    mul!(X_fft,plan_x,xp)

    #Create inverse FFT plan with PATIENT flag:
    plan_xi = plan_brfft(X_fft,N, flags=FFTW.PATIENT)
   
    #Vector to store solutions
    conv_full_xb = Vector{Float64}(undef,N)

    #Convolution 
    X_fft .*= G_fft
    
    #inverse FFT
    mul!(conv_full_xb,plan_xi,X_fft)

    for i in 1:10
        #Update 
        xp[1:lx] .+= x.^2

        #Perform FFT
        mul!(X_fft,plan_x,xp)
        
        #Convolve
        X_fft .*= G_fft
        
        #Inverse FFT
        mul!(conv_full_xb,plan_xi,X_fft)
    end

    return conv_full_xb[1:lx]

end

This is not a safe way to calculate the (integer) length. It does a floating point division followed by conversion to Int, but will fail for odd N.

Julia supports native integer arithmetic, which is faster and better for this usecase. Here you should use div to perform integer division:

fft_length = div(N, 2) + 1

or

fft_length = (N >> 1) + 1

This gives a valid fft length for both even and odd lengths (though even-length rfft is anyway faster).

1 Like

Alternatively, you could do

resize!(conv_full_xb, lx)

to avoid allocating new memory.

2 Likes

DNF, thank you! This makes sense.

Would @view conv_full_xb[1:lx] also be acceptable?

I guess… But then you are leaving a ‘hidden’ chunk of memory around under the hood, which seems a bit less pleasing to me. With resize! you free up that piece of memory, and you have a plain Vector, which is perhaps ‘cleaner’. What if you pass the vector on to some code that only accepts Vector?

It’s possibly not a very big deal, though.

1 Like

Makes sense! Thanks!

Regarding fft_length: The nextprod function is perfect for finding a good FFT length. E.g.

fft_length = nextprod((2, 3, 5), N ÷ 2 + 1)
2 Likes

@PeterSimon, Wouldn’t it be N = nextprod((2, 3, 5), N ÷ 2 + 1)? If so, it actually seems to increase the time slightly. I assume the added “efficiency” is outweighed by the increased number of computations (N is already divisible by 2 and 5). But that’s just a guess, as you can see from this thread, I really don’t know what the hell I am talking about…

For specific filters, like X and G in the example code, there are tailored algos which would be closer to O(N) instead of the FFT-based O(N*log(N)) and possibly with better constants.
So if the filters are general, FFT is definitely the way to go, otherwise this lemon might be squeezed more.

Sorry, you’re correct.

Your original value of N is 2_600_000, so assuming that your expression N ÷ 2 + 1 is the minimum allowable length of the padded arrays this is 1_300_001 and N = nextprod((2, 3, 5), N ÷ 2 + 1) yields a new value of N = 1_310_720. I am surprised to hear that using this length instead of 2_600_000 is slower, since it’s about half as large. And checking the factors:

julia> using Primes

julia> factor(2_600_000)
2^6 * 5^5 * 13

julia> factor(1_310_720)
2^18 * 5

it still seems like the result of nextprod should be faster, since it only involves 2 prime factors, rather than 3. Maybe others can explain this effect.

1 Like

@Dan cool! Where might I find these tailored algos?

With a fixed kernel it’s trivial to make a direct implementation of the convolution in linear time, but at these sizes it won’t be competitive with an FFT implementation. To improve the speed you need look into IIR filters. Those are generally more difficult to design but if g = 100.0*exp.(-100.0*(0.0:δ:filt_end)) is a representative filter you are in luck, as this is the simplest of all IIR filters.

2 Likes

The IIR approach is likely still best but it seems excessive to use a kernel of length 100000 when 99254 of the coefficients are exactly zero and all but the 29 first are smaller than 1e-10. Maybe this isn’t a real world example.

Just to make things concrete regarding faster filter, here is an example:

N = 1000   # length of vector
n = 100   # length of kernel

v = rand(N)  # a bit of random data to process

# naive implementation of convolution (delta = 0.1) 
ema = [sum(v[i-j]*exp(-0.1*j) for j in 0:(n-1)) for i in n:N];

@show ema[1]
@show length(ema)

# alternative implementation without FFT, using O(N) time
ema2 = similar(ema);
ma = sum(v[100-j]*exp(-0.1*j) for j in 0:(n-1))

ema2[1] = ma

for i in 2:(N-n+1)
    global ma
    ma -= v[i-1]*exp(-0.1*(n-1))
    ma *= exp(-0.1)
    ma += v[i+n-1]
    ema2[i] = ma
end

@show ema2[1]
@show ema2[2]
@show ema[2]
@show ema ≈ ema2

BTW EMA stands for exponential moving average, which is a common name for this kernel.

The other kernel in the OP can also have a tailored algorithm.