Performance Help on Large Matrix Manipulation

jleman · September 21, 2020, 6:20pm

I have a large array of matricies and need to apply inverse FFT as shown in the attached diagram. I do this by restructuring the data using getindex . After completing the inverse FFT I then recreating the original matrix structure using getindex and reshape. Any suggestions for performance improvements? Its currently way too slow. I have used StaticArrays.jl in the past, but these matrices are too large.

using LinearAlgebra
using FFTW
using BenchmarkTools

function computeBlock(A)
    rows = size(A[1],1)  #number of rows of A for later use in reshape     
    
    B = [zeros(4*length(A)) for _ in 1:length(A[1])]  #preallocate B array (including zero pad)
    
    pad = zeros(length(A)+Int(floor(length(A)/2)))  #create arrays for zero padding
    
    for i=1:length(A[1])  #add zero padding then shift for ifft
       B[i] = [pad;getindex.(A,i);pad]|>fftshift
    end

    C = real.(ifft.(B))  #take ifft and extract real part
    
    D = [zeros(size(first(A))) for _ in 1:4*length(A)] #preallocate D array
    
    for i = 1:length(C[1])  #reorganize time domain data into stacked matrices
          D[i] = reshape(getindex.(C,i),(rows,:))
    end

    return D
end

dim = 10 #would like this to be as high as 3000
frequencyDomainData = [rand(18,dim) for _ in 1:2^11]
@btime timeDomainData = computeBlock(frequencyDomainData)

stevengj · September 21, 2020, 6:50pm

As I understand it, you have an LxMxN dataset of Fourier components A, that you want to zero-pad into L’xMxN (with L’=4L) and then ifft into an L’xMxN array C (doing trigonometric interpolation)?

Realize that you are making tons of intermediate copies of your data, which is not very efficient. e.g. B[i] = [pad;getindex.(A,i);pad]|>fftshift allocates at least 2 temporary arrays and discards your previously preallocated B[i] array in favor of the new array on the right-hand side.

I would tend to drop all of the reshaping, and do something like:

use a 3d array rather than an array of arrays
use a precomputed plan with plan_fft, or probably plan_fft! to work in-place
copy into a preallocated zero-padded 3d array C using a single in-place operation on A
use the dims keyword of plan_fft to transform along 1 direction of your 3d array C all in a single call.

I notice that you are doing real(ifft(x)), which is discarding data unless the input frequency-domain data is somewhat special (i.e. the inputs equivalent to an FFT of real data). Either way, you can take advantage of the fact that you only want the real parts of the output by choosing a more specialized transform (e.g. an irfft or an discrete cosine transform via FFTW.r2r.)

jleman · September 21, 2020, 9:14pm

Thank you for the response! The only thing I am doing after the ifft is rearranging the data back into an array of matrices. A couple of follow-up questions

I’m struggling to find examples of plan_fft(usage), but if I understand the documentation correctly I can do something like the code below - does this look right (neglecting the zero padding for the moment)?

A = rand(ComplexF64,(4,4,10))  #created a 3D array of complex numbers
p = plan_ifft(A,3) #apply fft to arrays along the 3rd dimension
D = p*A #result is another 3D array

After the ifft, I need to do more operations on each matrix in the stack (transposition, inversion etc.). In the past it has seemed faster to iterate over an array of matrices to avoid slicing. Is there a way to do that on submatrices within the 3D dataset without incurring significant allocations?
The actual complex input data of the of the frequency domain data is such that the imaginary part of the ifft result is effectively zero (i.e. odd symmetry in the imaginary component). I don’t see a plan_irfft() option - are you referring to a different method or package?

Oscar_Smith · September 21, 2020, 9:30pm

The problem your are seeing with slices comes because they allocate. Check out @views. As of 1.5, this should be as fast (or faster) than using an array of arrays.

stevengj · September 21, 2020, 10:24pm

If your complex input data has conjugate symmetry (real part is symmetric, imaginary part is antisymmetric), then you only need to store half of it. Zero-padding also becomes easier because you only need to zero-pad one side, and there is no point in an fftshift.

FFTW exports a plan_irfft function (or plan_brfft if you don’t need the 1/N normalization); I’m not sure why you don’t see it?

Note that you can use the FFTW.mul! function (equivalent to mul! from the LinearAlgebra package to apply a plan with a precomputed output array.

jleman · September 19, 2020, 1:48am

I have a frequency domain data set organized as shown in the attached image (set up as an array of matrices like: [zeros(40,4000) for _ in 1:4096] )

I need to create the time domain data set at right by applying the inverse FFT to arrays consisting of elements in the same position of each of the matrices in the original array. For example, the red arrow shows the array grouping for the [1,1] matrix position of the frequency domain data set. This operation is performed for each matrix element position.

I prefer to have the data organized in this way for other reasons that call for iterating over each matrix and performing operations such as inversion. I could use the getindex function to group the arrays:

array = getindex.(Matrix,position)

However, it seems like this would be cumbersome since I would then have to reassemble the data set. Does anyone know of a better approach (prioritizing speed)?

jleman · September 21, 2020, 11:09pm

See this thread: Performance Help on Large Matrix Manipulation

stevengj · September 22, 2020, 12:35am

If you used a 3d array, you could still iterate over 2d views to do linear algebra on slices.

Otherwise, you don’t have much choice but to make a copy as you transform each slice, e.g. by slicevector .= getindex.(A, index) (using .= to write in-place into a preallocated slicevector, since FFTW requires a strided-vector memory layout.

jleman · September 22, 2020, 3:04am

So here is an example code block that creates a frequency domain array with real part symmetric and imaginary part antisymmetric for frequencies ranging from -3 to 3.

When you say I only need to store half the array, are you saying I can just compute the frequency domain array from 0 to 3 and supply that to the plan_irfft() function? What would the syntax look like for that?
Also, Is the output array likewise only half the original result? If so, how do I get the full set back (the set I would get if I used the full frequency range)?
Finally, how does the FFTW.mul! function you mention work? This was the only thread on Discourse that even mentions this function.

Thanks

using Plots
using FFTW
using SpecialFunctions
using BenchmarkTools

#create the complex frequency domain input array, fd. 
N = 2^8 #set to power of 2 for improved ifft efficiency
ω = range(-3,3, length = N)  #note: no ω=0 term (intentional) - could set to zero
fd =zeros(Complex{Float64},length(ω)) 
fd .= -10e-6*exp.(1im .* ω * 5) .* besselk.(0,abs.(ω)*5.0) ./ besselk.(0,abs.(ω)*0.05)

#create ifft plan and execute ifft
p = plan_ifft(fd,1)
td = p*(fd|>fftshift)

#plot output=====================================================
fdplot = plot(ω,real.(fd),label="real",lw=2)
plot!(ω,imag.(fd),label="imag",lw=2)
plot!(title="Frequncy Domain Input Data (pre-shift)")

tdplot = plot(1:N,real.(td|>fftshift),lw=2,label="real",legend=:right)
plot!(1:N,imag.(td|>fftshift),lw=2,label="imag")
plot!(title="Time Domain Output Data (post-shift)")

#imaginary component becomes negligible as # of samples increases

display(fdplot)
display(tdplot)

Plots should look like those below. Notice the imaginary part of the output time domain plot. This is an artifact of the sampling. This becomes negligible at higher sampling rates and I will discard the imaginary content.

stevengj · September 22, 2020, 3:10am

Yes.

You get the full real array back. See also The 1d Real-data DFT (FFTW 3.3.10)

See Linear Algebra · The Julia Language! — the plan (which acts like a linear operator, takes the place of the matrix A.

jleman · September 23, 2020, 3:54am

The above suggestions were very helpful. The approach of operating on the 3D array and using plan_ifft() sped things up a lot. I now have the following code block:

using LinearAlgebra
using FFTW
using BenchmarkTools
using SpecialFunctions

#function used to populate frequency domain 3D matrix "A"
bessk0(ω) = -10e-6*exp(1im*ω*rand(-1:0.3:1))*besselk(0,abs(ω)*rand(3:0.1:5))/besselk(0,abs(ω)*0.05)

function dim3ifft(ω,N)
    dim2 = 3000
    @time pad = zeros(18,dim2,3072)    
    @time A = zeros(Complex{Float64},(18,dim2,length(ω)))
    @time for i = 1:18, j = 1:dim2, k = 1:length(ω)
              A[i,j,k] = bessk0(ω[k])
          end
    @time conj!(A)
    @time B = reverse(A,dims=3)
    @time C = cat(pad,B,A,pad,dims=3)

    @time p = plan_ifft(C,3)
    @time D = p*C
    
    return D
end

N = 2^11
Ndiv2 = Int(floor(N/2))
ω = range(-3,3, length = N)[1:Ndiv2]
D = dim3ifft(ω,N)
println("done")

When I run the above I get the following times (I’m aware @btime is better, but this is indicative):

  0.466709 seconds (2 allocations: 1.236 GiB, 5.10% gc time)
  0.298569 seconds (2 allocations: 843.750 MiB, 0.17% gc time)
 63.657388 seconds
  0.070921 seconds
  0.329374 seconds (8.19 k allocations: 843.906 MiB, 5.34% gc time)
  2.634617 seconds (101 allocations: 6.592 GiB, 5.34% gc time)
  0.000218 seconds (38 allocations: 3.156 KiB)
 19.107431 seconds (2 allocations: 6.592 GiB, 0.72% gc time)

The ifft() function takes about 20 seconds. I’m good with that - I’m not sure I can do much better. I tried using irfft() but aliasing in the result was too high unless I used a much larger array. However, there are a couple of other areas where I wonder if I could shave a little time:

The call to bessk0() to populate Array A clearly takes the longest. Any suggestions for how to improve that function call?
Pre-allocating the arrays only takes about half a second, but still longer than I expected. Suggestions here?
Finally, the concatenate operation, cat(), takes longer than I would expect. Suggestions here?

johnh · September 23, 2020, 8:51am

Out of interest, what field are you working in. Seismics processing?

rafael.guerra · September 23, 2020, 1:14pm

To populate the array A faster in your example, I think you could do with a single loop:

@time for k = 1:length(ω)
          A[:,:,k] .= bessk0(ω[k])
      end

stevengj · September 23, 2020, 1:50pm

Then you are misunderstanding the function. irfft is exactly equivalent (up to roundoff errors) to real(ifft) if you have conjugate-symmetric data:

julia> z = [4, 3+15im, 4+7im, -2+3im, 6+1im, -5, 6-1im, -2-3im, 4-7im, 3-15im]
10-element Array{Complex{Int64},1}:
  4 + 0im
  3 + 15im
  4 + 7im
 -2 + 3im
  6 + 1im
 -5 + 0im
  6 - 1im
 -2 - 3im
  4 - 7im
  3 - 15im

julia> ifft(z)
10-element Array{Complex{Float64},1}:
                 2.1 + 0.0im
  -2.997615643301253 + 0.0im
 -3.0005626553354823 + 0.0im
 -1.7532205441845137 + 0.0im
 -2.3175772855077286 + 0.0im
                 2.7 + 0.0im
 -0.5476702987421247 + 0.0im
  1.9824001509345763 + 0.0im
   3.265810239585335 + 0.0im
    4.56843603655119 + 0.0im

julia> irfft(z[1:length(z)÷2+1], 10)
10-element Array{Float64,1}:
  2.1
 -2.997615643301252
 -3.000562655335482
 -1.7532205441845137
 -2.3175772855077286
  2.7
 -0.5476702987421247
  1.9824001509345763
  3.265810239585335
  4.56843603655119

stevengj · September 23, 2020, 1:56pm

You don’t need to use cat. If you use irfft correctly (so that you don’t need to append the conjugate), just allocate C = zeros(...) and assign a subarray with

C[...] = @view A[...]

Note also that

conj!(A)
B = reverse(A,dims=3)
C = cat(pad,B,A,pad,dims=3)

seems wrong. It doesn’t have conjugate symmetry at all because you conjugate A and then assign it (flipped) to B, rather than conjugating only B or only A. Furthermore, as shown in my example above, you shouldn’t duplicate the DC (0th) or Nyquist (middle) frequencies for the DFT’s conjugate symmetry corresponding to real inputs.

jleman · September 23, 2020, 8:11pm

Electromagnetics

jleman · September 24, 2020, 6:08am

Ok, I think I see what you mean. I also had some issues in my simplified code that were causing problems in the ifft algorithms. Here is what I ended up with:

using FFTW
using SpecialFunctions
using BenchmarkTools

#function use to populate frequency domain array
function bessk0(ω)
    if ω == 0
        result = -1e-6*log(20.6/5)/log(20/0.05) + 0im
    else
        result = -1e-6*exp(1im*ω*2)*
            ((besselk(0,abs(ω)*5) - besselk(0,abs(ω)*20.6))/(besselk(0,abs(ω)*0.05) - besselk(0,abs(ω)*20)))
    end
    return result
end

function ifftApproach()
    N = 2^8
    Ndiv2 = Int(floor(N/2))
    pad = zeros(N+Ndiv2)
    ω = range(-4,4, length = N+1) #n+1 to include the zero term, for use with ifft
    fd1 = bessk0.(ω)[1:end-1]
    fd1pad = vcat(pad,fd1,pad)|>fftshift
    p = plan_ifft(fd1pad,1)
    sd1 = p*fd1pad
    #sd1 = zeros(ComplexF64,length(fd1pad))
    #FFTW.mul!(sd1,p,fd1pad) #negligible reduction using mul!
    return sd1
end

function irfftApproach()
    N = 2^8
    Ndiv2 = Int(floor(N/2))
    pad = zeros(N+Ndiv2)
    ωh = range(0,4, length = Ndiv2+1) #for use with irfft
    fd2 = bessk0.(ωh)
    fd2pad = vcat(fd2,pad)
    p = plan_irfft(fd2pad,1024,1)
    sd2 = p*fd2pad
    #sd2 = zeros(1024)
    #FFTW.mul!(sd2,p,fd2pad) #negligible reduction using mul!
    return sd2
end

@btime sd1 = ifftApproach()
@btime sd2 = irfftApproach()

The result of the timing comparison is:

  380.600 μs (48 allocations: 62.78 KiB)
  212.500 μs (50 allocations: 24.86 KiB)

Nearly cut in half. I tried the mul! function (commented out lines), but saw neglible impact.
The cases produced effectively identical results. Thanks again.