# About In-line-fft

Hi Experts ,I’m just studing Julia But I’m in truble. Berrow 2D turblence simulation code is that I translated from Matlab is use too much memory. At my Simulation environment it use approx. 32 GB. I found some info about In-Line fft for saving memory. But It is not working well, I mean not saving memory. I would like to borrow your help. Sorry for my bad English

``````using FFTW
using Printf
using LinearAlgebra
using MAT
# set parameters
N               = 128                      ;    # N*N grid,
dt              = 0.02                     ;    # timestep
simutimeSeconds = 10                       ;

HowOftenSave    = 10                       ;    # save velocity field per 'HowOftenSave' timesteps
HowOftenVisu    = 10                       ;    # show illustration per 'HowOftenVisu' timesteps
simutimeSteps   = round(simutimeSeconds/dt);

nu              = 1/1000                   ;    # kinematic viscosity [nu] = m^2/s
L               = 2*pi                     ;    # domain side length
dx              = L/N                      ;    # grid spacing assumed dx=dy

Umax            = 1                        ;
F_0             = 0.03                     ;

# create fields
x = (L-dx)*(0:(N-1))/(N-1)                 ;    # to be precise, dx is subtracted here
y = (L-dx)*(0:(N-1))/(N-1)                 ;    # otherwise the derivative of e.g. sin(x) wouldnt be differentiable on this periodic grid
X=zeros(N,N);
Y=zeros(N,N);
for jx=1:N
for ix=1:N
X[ix,jx]=x[jx]
Y[ix,jx]=y[ix]
end
end

# wavenumbers
kx1 = mod.(1/2 .+ (0:(N-1))/N, 1) .- 1/2;
ky1 = mod.(1/2 .+ (0:(N-1))/N, 1) .- 1/2;
kx  = kx1*(2*pi/dx); # wavenumbers
ky  = ky1*(2*pi/dx); # wavenumbers

KX=zeros(N,N);
KY=zeros(N,N);
for jkx=1:N
for ikx=1:N
KX[ikx,jkx]=kx[jkx]
KY[ikx,jkx]=ky[ikx]
end
end

# Anti-aliasing filter based on the 2/3*kNyq rule
AA = (abs.(KX) .< (2/3) * maximum(kx)) .* (abs.(KY) .< (2/3) * maximum(ky))

# Runge-Kutta 4 coefficients
a=[1/6 1/3 1/3 1/6]; b=[1/2 1/2 1 1];

#--  Taylor-Green --#
kmax = 4;
U = -Umax .* cos.(kmax .* X) .* sin.(kmax .* Y)
V =  Umax .* sin.(kmax .* X) .* cos.(kmax .* Y)
U = U + randn(N,N)*F_0;
V = V + randn(N,N)*F_0;

Uhat = fft(U)
Vhat = fft(V)

Uhat = Uhat - (KX .* Uhat + KY .* Vhat) .* KX ./ (KX .^ 2 + KY .^ 2)
Uhat[isnan.(Uhat)] .= 0

Vhat = Vhat - (KX .* Uhat + KY .* Vhat) .* KY ./ (KX .^ 2 + KY .^ 2)
Vhat[isnan.(Vhat)] .= 0

# U = p_ifft * Uhat;
# V = p_ifft * Vhat;

matwrite("uv0000.mat", Dict("Uhat" => U, "Vhat" => V))

dUdx = zeros(N, N)
dUdy = zeros(N, N)
dVdx = zeros(N, N)
dVdy = zeros(N, N)
dUconv = complex(zeros(N, N))
dVconv = complex(zeros(N, N))
dUdiff = complex(zeros(N, N))
dVdiff = complex(zeros(N, N))
dU = complex(zeros(N, N))
dV = complex(zeros(N, N))
Uc = complex(zeros(N, N))
Vc = complex(zeros(N, N))
Uold = complex(zeros(N, N))
Vold = complex(zeros(N, N))

p_fft_U = plan_fft(U; flags=FFTW.ESTIMATE)
p_ifft_Uhat = plan_ifft(Uhat; flags=FFTW.ESTIMATE)
p_fft_V = plan_fft(V; flags=FFTW.ESTIMATE)
p_ifft_Vhat = plan_fft(Vhat; flags=FFTW.ESTIMATE)
p_ifft_dUdx = plan_ifft((1im * KX .* AA .* Uc); flags=FFTW.ESTIMATE)
p_ifft_dUdy = plan_ifft((1im * KY .* AA .* Uc); flags=FFTW.ESTIMATE)
p_ifft_dVdx = plan_ifft((1im * KX .* AA .* Vc); flags=FFTW.ESTIMATE)
p_ifft_dVdy = plan_ifft((1im * KY .* AA .* Vc); flags=FFTW.ESTIMATE)
p_fft_dUconv = plan_fft((dt*(-U .* dUdx -V .* dUdy) / 2); flags=FFTW.ESTIMATE)
p_fft_dVconv = plan_fft((dt*(-U .* dVdx -V .* dVdy) / 2); flags=FFTW.ESTIMATE)

# Navier-Stokes simulation begins
@time begin
for t=1:simutimeSteps
@printf("| Time step # %04d |\n",t)

#SolveNavierStokes2D;
Uold = Uhat;
Vold = Vhat;
Uc   = Uhat;
Vc   = Vhat;

for rk in 1:4

U = real.(p_ifft_Uhat*(AA .* Uc))
V = real.(p_ifft_Vhat*(AA .* Vc))

dUdx .= real.(p_ifft_dUdx * (1im * KX .* AA .* Uc));
dUdy .= real.(p_ifft_dUdy * (1im * KY .* AA .* Uc));
dVdx .= real.(p_ifft_dVdx * (1im * KX .* AA .* Vc));
dVdy .= real.(p_ifft_dVdy * (1im * KY .* AA .* Vc));
dUconv .= p_fft_dUconv*(dt*(-U .* dUdx -V .* dUdy) / 2);
dVconv .= p_fft_dVconv*(dt*(-U .* dVdx -V .* dVdy) / 2);

dUdiff .= nu * dt * ((-KX .* KX .* Uc) + (- KY .* KY .* Uc)) ;
dVdiff .= nu * dt * ((-KX .* KX .* Vc) + (- KY .* KY .* Vc)) ;

dU .= dUconv + dUdiff;
dV .= dVconv + dVdiff;

Uhat = Uhat-(KX.*Uhat + KY.*Vhat).*KX./(KX.^2 + KY.^2);
Uhat[isnan.(Uhat)] .= 0;
Vhat = Vhat-(KX.*Uhat + KY.*Vhat).*KY./(KX.^2 + KY.^2);
Vhat[isnan.(Vhat)] .= 0;

if rk < 4
Uhat = Uold + b[rk] * dU
Vhat = Vold + b[rk] * dV
end

Uc = Uc + a[rk] * dU;
Vc = Vc + a[rk] * dV;

end

Uhat = Uc
Vhat = Vc

Uhat = Uhat - ((KX .* Uhat + KY .* Vhat) .* KX ./ (KX .^ 2 + KY .^ 2))
Uhat[isnan.(Uhat)] .= 0

Vhat = Vhat - ((KX .* Uhat + KY .* Vhat) .* KY ./ (KX .^ 2 + KY .^ 2))
Vhat[isnan.(Vhat)] .= 0

U = real.(p_ifft_Uhat*Uhat);
V = real.(p_ifft_Vhat*Vhat);

# SaveInstantField2D
if mod(t, HowOftenSave) == 0
filename = "uv\$(@sprintf("%04d", t)).mat"
matwrite(filename, Dict("U" => U, "V" => V))
end

end
end
``````

Have you read

https://docs.julialang.org/en/v1/manual/performance-tips/

?

It seems that you are running your code in global scope, which is terrible for performance and will lead to excessive allocations. Make sure that your code is wrapped in a function when you run it.

Also I’m not a differential equation expert but the references to “Runge-Kutta coefficients” in your code make me suspect that you are rolling your own rk4 implementation here? If that’s the case I would concur with Oscar’s advice in a thread a few weeks ago:

2 Likes

Thanks for the advice.
I have tried to enclose all code in functions.
However, nothing has changed.
Specifically, I wrapped everything from “Setparameter” to “SaveInstantFields2D” in functions.
Then call function for simulation.

I honestory want to run rk4 from DefferentialEquattion.jl. But my Lab’s rk4 code progression is little unique so I cannot write rk4 from pakkage…

In that case I suppose you will have to start profiling your code to see where the time is spent and allocations are coming from using something like

https://docs.julialang.org/en/v1/manual/profile/

Here’s a whole recent thread on profiling allocations:

2 Likes

There are a few performance gotaches in your code that can be easily addressed:

1. The FFT plans you created allocate extra memory whenever you apply them with the `*` operator. To avoid that you should preallocate storage for those results and use the `mul!` function to evalute the FFT into a buffer. This is mentioned in the docs here (which unfortunately are separated from the docs of `FFTW.jl`) API · AbstractFFTs.jl

2. I see you spliced in many `.` do broadcast operations, however, there are a few lines where you missed one or two. Perhaps consider the `@.` macro instead that automatically applies `.` to every operation in the expression it preceeds, see Arrays · The Julia Language

3. Try following the performance tips that were linked earlier. In particular try wrapping everything (including the non-constant parameters) into a function.

4. Try using a profiler like `BenchmarkTools.jl`, `PProf.jl` or `ProfileView.jl` to get detailed infos of where your code is bottlenecked.

2 Likes

I don’t believe this code is doing what it is supposed to do or at least it is horribly inefficient in what it tries to do. I put some annotations below.

To use inplace ffts, I think it is sufficient to append a “!” to your `plan_fft` calls.

2 Likes

To apply a plan with no allocations, you have two choices:

1. Use a pre-allocated output array and use `mul!`, e.g. `mul!(output_array, p_ifft_dUdx, output_array, input_array)`
2. Create an in-place plan with `plan_fft!`

Note that you have lots of other unnecessary memory allocations.

• e.g. `1im * KX .* AA .* Uc` allocates two arrays, one for `1im * KX` and one for for the result of the other multiplications (since the `1im *` operation is not fused, see e.g. the more dots blog post). If you want to make it fully allocation-free you would also want to pre-allocate the output, e.g. `@. dUdx_fourier = 1im * KX * AA * Uc` (`@.` makes it a lot more readable), then do `dUdx .= real.(p_ifft_dUdx * dUdx_fourier)` with your in-place plan (via `plan_fft!`).
• `complex(zeros(N, N))` allocates an array of real zeros, then allocates a new array of complex zeros. To use a single allocation, do `zeros(ComplexF64, N, N)`.
• `dU .= dUconv + dUdiff` allocates an array for the right hand side, because you didn’t use `.+`
• `Uhat = Uhat-(KX.*Uhat + KY.*Vhat).*KX./(KX.^2 + KY.^2)` allocates, because you didn’t use `.+` or `.=`. (Would have been clearer just to use `@.`)
• `Uhat[isnan.(Uhat)] .= 0` allocates an array for the result of `isnan.(Uhat)`
• etc.

PS. There is no need for a semicolon (`;`) after every statement.

PPS. See the very first performance tip in the manual: performance-critical code should be inside a function, and should not use globals. Simply putting the code inside a function is not enough if your parameters are still global variables, not function arguments.

PPPS. In general, porting Matlab code line-by-line will result in suboptimal Julia code. You take full advantage of Julia by writing in a Julia style, not in a Matlab style.

PPPPS. If your FFT outputs are always real, i.e. your FFT inputs have conjugate symmetry, then you can save a factor of two in time and memory by using the `rfft` interface (where you only store the non-redundant half in the Fourier domain). This is a little trickier to work with, however, so I would put that off until you’ve gotten everything else going smoothly and have a better understanding of Julia.

6 Likes