I have written a Spectral Proper Orthogonal Decomposition (doi:10.1017/jfm.2018.283) algorithm in both Python and Julia, and I’ve found my Python code to be about 5x faster than the Julia code. I expect the performance to be comparable, since the Python code makes heavy usage of NumPy and SciPy, but I was still expecting Julia to be faster. I was hoping to get some advice on where I could make some performance improvements in Julia.

I will try to give a brief explanation of the major steps as they appear in the code, and section 3 in the paper linked gives an excellent description of the algorithm. The input is an array with the first dimension being temporal snapshots of some field and the following dimensions are the spatial dimensions. I’ve included my TimerOutputs sectioning as well.

```
using FFTW, Statistics, LinearAlgebra, TimerOutputs
function do_SPOD(inp, nBlocks, nOverlap, dt)
to = TimerOutput()
# Reshape the input into a matrix of dims (time, npts) to make things simpler,
# but remember the inputs dims.
# It should be given in shape (time, spatial dimensions).
dims = size(inp)
inp = reshape(inp, (dims[1], :))
# I've cut out some preamble here that addresses some configuration options,
# but the preamble consumes a very small fraction of the runtime.
# Assuming the input is real cuts half of the frequencies we need to consider.
# Essentially a series of literals for the test case.
window = 1; window_mean = 1; weights = 1; is_real = true;
n_snapshots = fld((dims[1] - nOverlap), nBlocks) + nOverlap
nDFT = fld(n_snapshots, 2) + 1
kappa = dt / nBlocks
freq = fftfreq(n_snapshots, 1 / dt)
# We want to perform a series of FFTs, use a FFT plan to speed up the process
@timeit to "FFT_Plan" fft_plan = plan_rfft(inp[1:n_snapshots, :], 1)
# Now we seperate our input data into nBlocks overlapping ensembles,
# each overlapping by nOverlap. Perform the FFT on each ensemble.
# The overlap means we can't operate on inp in-place. Initialise an array to store the data in.
@timeit to "Assign Qhat" Qhat = Array{ComplexF64, 3}(undef, nBlocks, nDFT, size(inp, 2))
# This for loop takes ~15% of the runtime according to TimerOutputs
# For a test run with input data of shape (5000, 6875), nBlocks = 38, nOverlap = 128
# Julia takes 1.1s, Python takes 0.5s on my machine
@timeit to "Perform FFTs" for i = 0:nBlocks-1
# Separate operations here for a bit of clarity
# The indices which denote the start and end of the ensemble
indx_1 = i * fld((size(inp, 1) - nOverlap), nBlocks) + 1
indx_2 = (i + 1) * fld((size(inp, 1) - nOverlap), nBlocks) + nOverlap
Qhat[i+1, :, :] = fft_plan * (inp[indx_1:indx_2, :] .* window ./ (n_snapshots ^ 0.5 * window_mean))
end
# Scale Qhat by κ and then do the permutation from (ensembles, frequencies, space) to (frequencies, ensembles, space)
Qhat *= kappa ^ 0.5
Qhat = permutedims(Qhat, (2, 1, 3))
# Now perform the eigendecomposition and resolve the spatial representations of each mode
# Expensive part: ~75% of the runtime
# ~6s in Julia, ~1.5s in Python
@timeit to "Eigendecomposition" eig_vals, spatial_rep = Do_EigenDecomposition.(eachslice(Qhat, dims = 1), weights, nBlocks)
show(to)
return freq, eig_vals, spatial_rep
end
function Do_EigenDecomposition(Q_f_k, weights, nBlocks::Int)
M_f_k = conj(Q_f_k) * transpose(Q_f_k .* weights) / nBlocks
# Solve the eigendecomposition
eigvals, eigfuns = eigen(M_f_k)
# Arrange eigenvalues and functions in descending order of eigenvalue
perm = sortperm(eigvals, rev = true)
eig_vals = eigvals[perm]
eigfuns = eigfuns[perm, :]
# Resolve the spatial representations of the eigenmodes
spatial_rep = permutedims(transpose(Q_f_k) * (eigfuns * Diagonal(eig_vals) ^ -0.5) / nBlocks ^ 0.5, (2, 1))
return eig_vals, spatial_rep
end
function main()
# Generate some dummy data, same shape as the real data I used
p = (100 * rand(Float64, (4992, 6825)) .- 200);
dt = 0.01;
freq, eigvals, eigfuns = do_SPOD(p[1:4992, :, :], 38, 128, dt);
end
main()
```

I used TimerOutputs to determine where the expensive section were, primarily the FFT and the Eigendecomposition which is hardly surprising. But the fact the SciPy versions of these appear noticeably faster than the Julia versions surprised me. Am I missing anything with the way I’m setting up these operations? I have significant allocations in the FFT loop, and I was hoping I could perhaps avoid some intermediate allocations with the @. macro, but I couldn’t get my around how to massage the slices of the input matrix into a shape amenable to that.

Any advice is appreciated. Cheers.