Optimizing a Spectral Proper Orthogonal Decomposition Method

I have written a Spectral Proper Orthogonal Decomposition (doi:10.1017/jfm.2018.283) algorithm in both Python and Julia, and I’ve found my Python code to be about 5x faster than the Julia code. I expect the performance to be comparable, since the Python code makes heavy usage of NumPy and SciPy, but I was still expecting Julia to be faster. I was hoping to get some advice on where I could make some performance improvements in Julia.

I will try to give a brief explanation of the major steps as they appear in the code, and section 3 in the paper linked gives an excellent description of the algorithm. The input is an array with the first dimension being temporal snapshots of some field and the following dimensions are the spatial dimensions. I’ve included my TimerOutputs sectioning as well.

using FFTW, Statistics, LinearAlgebra, TimerOutputs

function do_SPOD(inp, nBlocks, nOverlap, dt)
    to = TimerOutput()
    # Reshape the input into a matrix of dims (time, npts) to make things simpler,
    # but remember the inputs dims.
    # It should be given in shape (time, spatial dimensions).
    dims = size(inp)
    inp = reshape(inp, (dims[1], :))

    # I've cut out some preamble here that addresses some configuration options,
    # but the preamble consumes a very small fraction of the runtime. 
    # Assuming the input is real cuts half of the frequencies we need to consider.
    # Essentially a series of literals for the test case.
    window = 1; window_mean = 1; weights = 1; is_real = true;
    n_snapshots = fld((dims[1] - nOverlap), nBlocks) + nOverlap
    nDFT = fld(n_snapshots, 2) + 1
    kappa = dt / nBlocks
    freq = fftfreq(n_snapshots, 1 / dt)

    # We want to perform a series of FFTs, use a FFT plan to speed up the process
    @timeit to "FFT_Plan" fft_plan = plan_rfft(inp[1:n_snapshots, :], 1)

    # Now we seperate our input data into nBlocks overlapping ensembles,
    # each overlapping by nOverlap. Perform the FFT on each ensemble. 
    # The overlap means we can't operate on inp in-place. Initialise an array to store the data in.
    @timeit to "Assign Qhat" Qhat = Array{ComplexF64, 3}(undef, nBlocks, nDFT, size(inp, 2))

    # This for loop takes ~15% of the runtime according to TimerOutputs
    # For a test run with input data of shape (5000, 6875), nBlocks = 38, nOverlap = 128
    # Julia takes 1.1s, Python takes 0.5s on my machine
    @timeit to "Perform FFTs" for i = 0:nBlocks-1
        # Separate operations here for a bit of clarity
        # The indices which denote the start and end of the ensemble
        indx_1 = i * fld((size(inp, 1) - nOverlap), nBlocks) + 1
        indx_2 = (i + 1) * fld((size(inp, 1) - nOverlap), nBlocks) + nOverlap
        
        Qhat[i+1, :, :] = fft_plan * (inp[indx_1:indx_2, :] .* window ./ (n_snapshots ^ 0.5 * window_mean))
    end
    
    # Scale Qhat by κ and then do the permutation from (ensembles, frequencies, space) to (frequencies, ensembles, space)
    Qhat *= kappa ^ 0.5
    Qhat = permutedims(Qhat, (2, 1, 3))
    
    # Now perform the eigendecomposition and resolve the spatial representations of each mode
    # Expensive part: ~75% of the runtime
    # ~6s in Julia, ~1.5s in Python
    @timeit to "Eigendecomposition" eig_vals, spatial_rep = Do_EigenDecomposition.(eachslice(Qhat, dims = 1), weights, nBlocks)
    
    show(to)

    return freq, eig_vals, spatial_rep
end  
        
function Do_EigenDecomposition(Q_f_k, weights, nBlocks::Int)
    M_f_k = conj(Q_f_k) * transpose(Q_f_k .* weights) / nBlocks

    # Solve the eigendecomposition
    eigvals, eigfuns = eigen(M_f_k)

    # Arrange eigenvalues and functions in descending order of eigenvalue
    perm = sortperm(eigvals, rev = true)
    eig_vals = eigvals[perm]
    eigfuns = eigfuns[perm, :]

    # Resolve the spatial representations of the eigenmodes
    spatial_rep = permutedims(transpose(Q_f_k) * (eigfuns * Diagonal(eig_vals) ^ -0.5) / nBlocks ^ 0.5, (2, 1))
    return eig_vals, spatial_rep
end

function main()
    
    # Generate some dummy data, same shape as the real data I used
    p = (100 * rand(Float64, (4992, 6825)) .- 200);
    dt = 0.01;
   
    freq, eigvals, eigfuns = do_SPOD(p[1:4992, :, :], 38, 128, dt);

end

main()

I used TimerOutputs to determine where the expensive section were, primarily the FFT and the Eigendecomposition which is hardly surprising. But the fact the SciPy versions of these appear noticeably faster than the Julia versions surprised me. Am I missing anything with the way I’m setting up these operations? I have significant allocations in the FFT loop, and I was hoping I could perhaps avoid some intermediate allocations with the @. macro, but I couldn’t get my around how to massage the slices of the input matrix into a shape amenable to that.

Any advice is appreciated. Cheers.

Could you try using MKL.jl and checking if that helps with the eigen performance? It usually does on Intel processors.

You can probably gain a little bit by passing flags = FFTW.MEASURE to plan_rfft, which is not the default.

However, after running your code, about 40% of the time seems to be spent on the computation of spatial_rep. The computation of M_f_k is about 30%, while eigen itself corresponds to a tiny 1%. So what I would recommend is to rethink the order of the dimensions in each array. You might also see some gains by computing these arrays using actual for loops instead of using permutedims / transpose / Diagonal.

Note that arrays in Julia are column-major, which is the opposite of numpy. You can have important gains if you take this into account. In particular, in the call to Do_EigenDecomposition, the optimal would be to perform eachslice along dimension 3:

# Before:
# @timeit to "permute Qhat" Qhat = permutedims(Qhat, (2, 1, 3))
# @timeit to "Eigendecomposition" eig_vals, spatial_rep =
#   Do_EigenDecomposition.(eachslice(Qhat, dims = 1), weights, nBlocks)

@timeit to "permute Qhat" Qhat = permutedims(Qhat, (1, 3, 2))
@timeit to "Eigendecomposition" eig_vals, spatial_rep =
    Do_EigenDecomposition.(eachslice(Qhat, dims = 3), weights, nBlocks)

Finally, in several places it would be possible to reduce allocations, but I think these changes should already get you quite far.

Appreciate the advice, I forgot that Julia was column-major. Making those dimension ordering adjustments halved the FFT time and reduced the eigendecomposition time by 33%. The transpose and permutedims operations on the spatial_rep are also gone. Converting the do_EigenDecomposition function to a for loop had negligible effect. Still about 2-3x slower than Python though, unfortunately.

Regarding the usage of MKL.jl, my machine runs AMD processors.

You should still try MKL. It’s often better even for AMD.

Keep in mind that array slices produce copies. Use views to avoid this.

Things like

also produces a full new allocation of the entire Qhat Array. Do

Qhat .*= sqrt(kappa) 

to avoid this. I think you have a lot of stuff like that in there. (btw, you should get rid of the x^0.5 habit. It doesn’t matter here, but in tight loops this is much slower than sqrt(x)).

1 Like

You can also use mul! with the fft_plan to write directly into your preallocated output array (rather than allocating a new array and copying to your output). Beware that in Julia, Qhat[i+1, :, :] is discontiguous … writing to a contiguous slice requires you to re-order your indices to Qhat[:, :, i+1]

I’ll give the MKL library a go and see if that affects the timings. I tried using the MKL FFTW provider and saw negligible difference, but maybe it will improve the eigen() call.

I’ll keep in mind the comment regarding ^. Is the best way to use the @view allocating the view to a different name then apply it? I was trying to use view() in place on a slice in the middle of an operation, which failed, but creating a new @view then using that worked. For example,

indx_1 = i * fld((size(inp, 1) - nOverlap), nBlocks) + 1
indx_2 = (i + 1) * fld((size(inp, 1) - nOverlap), nBlocks) + nOverlap

# The original version, after shuffling of the dimensions for memory contiguity
Qhat[:, :, i+1] = fft_plan * (inp[:, indx_1:indx_2] .* window ./ (n_snapshots ^ 0.5 * window_mean))

# The first view version I tried- get bounds error
Qhat[:, :, i+1] = fft_plan * (view(inp[:, indx_1:indx_2]) .* window ./ (sqrt(n_snapshots) * window_mean))

# Writing the view to a variable then applying works fine
ensemble_space = @view inp[:, indx_1:indx_2]
Qhat[:, :, i+1] = fft_plan * (ensemble_space .* window ./ (sqrt(n_snapshots) * window_mean))

I should probably do the final case regardless, I expect the cost of this is negligible, if any, and it improves clarity. The other place I would in theory use the view is in the calculation of M_f_k:

# Calculate the M_f_k matrix, initial form
M_f_k = Qhat[:, :, freq_indx]' * Qhat[:, :, freq_indx] * weights / nBlocks

# In @view form
freq_ensemble = @view Qhat[:, :, freq_indx]
M_f_k = freq_ensemble' * freq_ensemble * weights / nBlocks

The issue I thought I was going to hit was the adjoint operator ': I thought that would modify the view ‘in-place’ and mess with the data that the view points to, but it doesn’t seem to. Is it safest to assume that operations on arrays generate copies unless calling the in-place ! version? Is a copy of a view also a view of the same data?

Regarding the usage of in-place multiplication, that worked well in the eigendecomposition. But using mul!() with the FFT plan has given me some issues. Namely:

ERROR: LoadError: MethodError: no method matching mul!(::Matrix{ComplexF64}, ::FFTW.rFFTWPlan{Float32, -1, false, 2, Int64}, ::Matrix{ComplexF64}, ::Bool, ::Bool)

I know the signature I am passing is (Matrix, FFTWPlan, Matrix), so I’m assuming the function I call is calling another function with these Bool arguments underneath. Is the best way to investigate these functions simply read the source code? Or is there a REPL method to grab the source code for a function given a particular signature?

Why are you using an Float32-precision plan_rfft with inputs that are a Matrix{ComplexF64}? Shouldn’t the inputs be a Matrix{Float32}?

Regarding the fact it is complex, I tried casting the input to the FFT operation to complex to match the shape of the assigning array, but I think that was unnecessary. I saw the same leaving it as Float64.

It took a bit longer to work out why the FFT plan was using Float32 type but the mul!() was seeing Float64, but I realised the scalar I’m applying to the view is a Float64.

# Create the view- inp is now a Float64 array from the start.
ensemble_space = @view inp[:, indx_1:indx_2]
# This is the version I was playing with when I got the quoted error message.
# mul!(Qhat[:, :, i+1], fft_plan, (complex(ensemble_space) .* window ./ (sqrt(n_snapshots) * window_mean)))
# This is the version I have now which works- I casted the inp Array to Float64 at the start of the fn
mul!(Qhat[:, :, i+1], fft_plan, (ensemble_space .* window ./ (sqrt(n_snapshots) * window_mean)))

I’m now hitting errors further down, trying to sort the eigenvalues, which is not surprising. The only thing that surprises me is that the error didn’t appear earlier, the eigenvalues should always have been complex. So something is different about the data in Qhat when assigning by mul!() compared to assigning by “Qhat .= fft_plan * …”?

# Assign two versions to compare
Qhat1 = Array{Complex64, 3}(undef, size(inp, 1), nDFT, nBlocks)
Qhat2 = Array{Complex64, 3}(undef, size(inp, 1), nDFT, nBlocks)

# More code here

# The mul! version
mul!(Qhat1[:, :, i+1], fft_plan, (ensemble_space .* window ./ (sqrt(n_snapshots) * window_mean)))
# The .= version
Qhat2[:, :, i+1] .= fft_plan * (ensemble_space * window) ./ (sqrt(n_snapshots) * window_mean)

Qhat1 == Qhat2 # False

Why is this the case? Am I missing something obvious here? The eigenvalues of the slices of the .= version are Float64s, while using mul!() they are ComplexF64s, despite both Qhats being arrays of ComplexF64. I expect the imaginary parts are all effectively 0 so it casts them to Reals.

I’ll just address this single point right now:

Slices make copies, so here you create a copy, write the output to it, then throw it away(!), because you’re not keeping any reference to it. You need a view.

Adjoint (') neither mutates nor copies, it’s a lazy operation. From the docs:

Lazy wrapper type for an adjoint view of the underlying linear algebra object, usually an AbstractVector / AbstractMatrix , but also some Factorization , for instance. Usually, the Adjoint constructor should not be called directly, use adjoint instead. To materialize the view use copy .

So the adjoint remembers to switch the order of the indices, but often it is simply used for dispatching to the correct BLAS method. You can consider it both effiient and safe.

BTW, a tip:

The @ sign is used to indicate macros in Julia, but on discourse, @ is what you use to tag/mention other users. So if you write @DNF, I get a notification. If there’s someone on discourse with the user name “view” they will get a notification. So it’s good practice to wrap macro names in backticks to avoid that: @view.

Here’s a typo. It’s either @view(inp[:, indx_1:indx_2]) or view(inp, :, indx_1:indx_2), not view(inp[:, indx_1:indx_2]). The latter will first create a slice (i.e. a copy), and then pass that whole array into the view function, with no index specifications, which doesn’t make sense.

There seems to be quite a few opportunities for performance improvements in your code. I don’t have time right now, though.

Ah of course, I didn’t think of the space being written to by mul! as a copy of the original space. That gets the speed down to be the same as the scipy FFT. The eigendecomposition is still a bit slower, but I think I have learned enough from this thread to make some more improvements.

Yep, I do remember reading that now you mention it. I will remember that for future. Really appreciate the pointers, hard to get out of the habit of thinking about things Pythonically after so many years.

Interesting (do you mean for floats/scalar, or especially for matrices?), I think I see why, with ^ more general, but since Julia already has some tricks for integer powers, can’t Julia, and should, be made to do this transformation for you, for power of 0.5 (and e.g. 0.25)? And maybe for 1/3 (cube-root) too, while unclear it should, since 1/3 isn’t exact, is that power less accurate than cbrt?

It seems to me the transformation (generic code) could/would work for all types, including matrices.

I noticed really slow on first use (should it be in the sysimage?):

julia> @time [2.0 3.0; 4.0 5.0]^0.5
 12.733727 seconds (2.89 M allocations: 143.074 MiB, 0.49% gc time, 100.00% compilation time)

This is also slow, actually 4 sec slower on first use, and strangely is reporting the wrong time, i.e. missing “compilation time”:

julia> @time sqrt([2.0 3.0; 4.0 5.0])
  0.000216 seconds (17 allocations: 1.750 KiB)

Both are fast on second use, sqrt slightly faster.

The degenerate matrix doesn’t work (maybe intentionally):

julia> sqrt([2])
ERROR: MethodError: no method matching sqrt(::Vector{Int64})

About:

I was really confused at first about changing to sqrt would do that, overlooking you meant the change to .*= the main speedup.

I meant specifically for scalars, but it may be the same for matrices.

Not a Matrix :wink:

julia> sqrt([2;;])
1×1 Matrix{Float64}:
 1.4142135623730951

I knew that, in e.g. MATLAB [2] is a matrix and in math. While it’s the Vector type in Julia, vectors are also matrices mathematically, and sqrt or power requires them to be square, so not a problem, except in the degenerate case… why I mentioned “intentionally”.

I’m not too worried about the degenerate case, but (after first use) sqrt is 3.9 times slower (using @time) or 7.2 times (timed with @btime) because of allocations:

julia> @btime sqrt([2.0;;]);
  1.091 μs (15 allocations: 1.16 KiB)

julia> @btime [2.0;;]^0.5;
  151.145 ns (2 allocations: 128 bytes)

Your mileage may wary, likely this reverses with larger matrices.

Note sqrt([2.0;;]) goes through the algorithm for square root of general matrices (intended for much larger ones), which of course is completely overkill here. This is probably the cause of the large time, not the allocations.

I don’t think I agree with that. A vector isn’t a degenerate matrix, they are fundamentally different things. A matrix is a linear operator, while a vector is an inhabitant of R^n. It’s not the same at all.

(Yeah, yeah, it doesn’t have to be R^n, but any n-dimensional vector space. I guess this is complicated, I should stop writing before things get too confused. But this is one of the reasons that it’s great that Julia distinguishes vectors and matrices in the type domain, because they are very different things in the mathematical sense.)

I know, I was just investigating sqrt vs, ^0.5. and sqrt is actually faster for 2x2 and 3x3, so don’t go avoiding it, then allocations stay at 15 for sqrt, but go up to 22 for ^0.5.