About In-line-fft

Hi Experts ,I’m just studing Julia But I’m in truble. Berrow 2D turblence simulation code is that I translated from Matlab is use too much memory. At my Simulation environment it use approx. 32 GB. I found some info about In-Line fft for saving memory. But It is not working well, I mean not saving memory. I would like to borrow your help. Sorry for my bad English

using FFTW
using Printf
using LinearAlgebra        
using MAT
# set parameters
N               = 128                      ;    # N*N grid,  
dt              = 0.02                     ;    # timestep
simutimeSeconds = 10                       ;

HowOftenSave    = 10                       ;    # save velocity field per 'HowOftenSave' timesteps
HowOftenVisu    = 10                       ;    # show illustration per 'HowOftenVisu' timesteps
simutimeSteps   = round(simutimeSeconds/dt);

nu              = 1/1000                   ;    # kinematic viscosity [nu] = m^2/s
L               = 2*pi                     ;    # domain side length
dx              = L/N                      ;    # grid spacing assumed dx=dy

Umax            = 1                        ;
F_0             = 0.03                     ;

# create fields
x = (L-dx)*(0:(N-1))/(N-1)                 ;    # to be precise, dx is subtracted here 
y = (L-dx)*(0:(N-1))/(N-1)                 ;    # otherwise the derivative of e.g. sin(x) wouldnt be differentiable on this periodic grid  
X=zeros(N,N);
Y=zeros(N,N);
for jx=1:N
    for ix=1:N
        X[ix,jx]=x[jx]
        Y[ix,jx]=y[ix]
    end
end

# wavenumbers
kx1 = mod.(1/2 .+ (0:(N-1))/N, 1) .- 1/2;
ky1 = mod.(1/2 .+ (0:(N-1))/N, 1) .- 1/2;
kx  = kx1*(2*pi/dx); # wavenumbers
ky  = ky1*(2*pi/dx); # wavenumbers

KX=zeros(N,N);
KY=zeros(N,N);
for jkx=1:N
    for ikx=1:N
        KX[ikx,jkx]=kx[jkx]
        KY[ikx,jkx]=ky[ikx]
    end
end

# Anti-aliasing filter based on the 2/3*kNyq rule
AA = (abs.(KX) .< (2/3) * maximum(kx)) .* (abs.(KY) .< (2/3) * maximum(ky)) 

# Runge-Kutta 4 coefficients
a=[1/6 1/3 1/3 1/6]; b=[1/2 1/2 1 1];

#--  Taylor-Green --#
kmax = 4;
U = -Umax .* cos.(kmax .* X) .* sin.(kmax .* Y)
V =  Umax .* sin.(kmax .* X) .* cos.(kmax .* Y)
U = U + randn(N,N)*F_0;
V = V + randn(N,N)*F_0;

Uhat = fft(U)
Vhat = fft(V)

Uhat = Uhat - (KX .* Uhat + KY .* Vhat) .* KX ./ (KX .^ 2 + KY .^ 2)
Uhat[isnan.(Uhat)] .= 0

Vhat = Vhat - (KX .* Uhat + KY .* Vhat) .* KY ./ (KX .^ 2 + KY .^ 2)
Vhat[isnan.(Vhat)] .= 0

# U = p_ifft * Uhat;
# V = p_ifft * Vhat;

matwrite("uv0000.mat", Dict("Uhat" => U, "Vhat" => V))

dUdx = zeros(N, N)
dUdy = zeros(N, N)
dVdx = zeros(N, N)
dVdy = zeros(N, N)
dUconv = complex(zeros(N, N))
dVconv = complex(zeros(N, N))
dUdiff = complex(zeros(N, N))
dVdiff = complex(zeros(N, N))
dU = complex(zeros(N, N))
dV = complex(zeros(N, N))
Uc = complex(zeros(N, N))
Vc = complex(zeros(N, N))
Uold = complex(zeros(N, N))
Vold = complex(zeros(N, N))


p_fft_U = plan_fft(U; flags=FFTW.ESTIMATE)
p_ifft_Uhat = plan_ifft(Uhat; flags=FFTW.ESTIMATE)
p_fft_V = plan_fft(V; flags=FFTW.ESTIMATE)
p_ifft_Vhat = plan_fft(Vhat; flags=FFTW.ESTIMATE)
p_ifft_dUdx = plan_ifft((1im * KX .* AA .* Uc); flags=FFTW.ESTIMATE)
p_ifft_dUdy = plan_ifft((1im * KY .* AA .* Uc); flags=FFTW.ESTIMATE)
p_ifft_dVdx = plan_ifft((1im * KX .* AA .* Vc); flags=FFTW.ESTIMATE)
p_ifft_dVdy = plan_ifft((1im * KY .* AA .* Vc); flags=FFTW.ESTIMATE)
p_fft_dUconv = plan_fft((dt*(-U .* dUdx -V .* dUdy) / 2); flags=FFTW.ESTIMATE)
p_fft_dVconv = plan_fft((dt*(-U .* dVdx -V .* dVdy) / 2); flags=FFTW.ESTIMATE)

# Navier-Stokes simulation begins
@time begin
for t=1:simutimeSteps
    @printf("| Time step # %04d |\n",t)

    #SolveNavierStokes2D;
    Uold = Uhat; 
    Vold = Vhat;
    Uc   = Uhat; 
    Vc   = Vhat;

    for rk in 1:4

        U = real.(p_ifft_Uhat*(AA .* Uc))
        V = real.(p_ifft_Vhat*(AA .* Vc))

        
        dUdx .= real.(p_ifft_dUdx * (1im * KX .* AA .* Uc));
        dUdy .= real.(p_ifft_dUdy * (1im * KY .* AA .* Uc));
        dVdx .= real.(p_ifft_dVdx * (1im * KX .* AA .* Vc));
        dVdy .= real.(p_ifft_dVdy * (1im * KY .* AA .* Vc));
        dUconv .= p_fft_dUconv*(dt*(-U .* dUdx -V .* dUdy) / 2);
        dVconv .= p_fft_dVconv*(dt*(-U .* dVdx -V .* dVdy) / 2);
        
        dUdiff .= nu * dt * ((-KX .* KX .* Uc) + (- KY .* KY .* Uc)) ;
        dVdiff .= nu * dt * ((-KX .* KX .* Vc) + (- KY .* KY .* Vc)) ;

        dU .= dUconv + dUdiff;
        dV .= dVconv + dVdiff;

        Uhat = Uhat-(KX.*Uhat + KY.*Vhat).*KX./(KX.^2 + KY.^2); 
        Uhat[isnan.(Uhat)] .= 0;
        Vhat = Vhat-(KX.*Uhat + KY.*Vhat).*KY./(KX.^2 + KY.^2); 
        Vhat[isnan.(Vhat)] .= 0;

        if rk < 4
            Uhat = Uold + b[rk] * dU
            Vhat = Vold + b[rk] * dV
        end        

        Uc = Uc + a[rk] * dU;
        Vc = Vc + a[rk] * dV;

    end

    Uhat = Uc
    Vhat = Vc

    Uhat = Uhat - ((KX .* Uhat + KY .* Vhat) .* KX ./ (KX .^ 2 + KY .^ 2))
    Uhat[isnan.(Uhat)] .= 0

    Vhat = Vhat - ((KX .* Uhat + KY .* Vhat) .* KY ./ (KX .^ 2 + KY .^ 2))
    Vhat[isnan.(Vhat)] .= 0

    U = real.(p_ifft_Uhat*Uhat);  
    V = real.(p_ifft_Vhat*Vhat);

    # SaveInstantField2D
    if mod(t, HowOftenSave) == 0
    filename = "uv$(@sprintf("%04d", t)).mat"
    matwrite(filename, Dict("U" => U, "V" => V))
    end

end 
end

Have you read

https://docs.julialang.org/en/v1/manual/performance-tips/

?

It seems that you are running your code in global scope, which is terrible for performance and will lead to excessive allocations. Make sure that your code is wrapped in a function when you run it.

Also I’m not a differential equation expert but the references to “Runge-Kutta coefficients” in your code make me suspect that you are rolling your own rk4 implementation here? If that’s the case I would concur with Oscar’s advice in a thread a few weeks ago:

2 Likes

Thanks for the advice.
I have tried to enclose all code in functions.
However, nothing has changed.
Specifically, I wrapped everything from “Setparameter” to “SaveInstantFields2D” in functions.
Then call function for simulation.

I honestory want to run rk4 from DefferentialEquattion.jl. But my Lab’s rk4 code progression is little unique so I cannot write rk4 from pakkage…

In that case I suppose you will have to start profiling your code to see where the time is spent and allocations are coming from using something like

https://docs.julialang.org/en/v1/manual/profile/

Here’s a whole recent thread on profiling allocations:

2 Likes

There are a few performance gotaches in your code that can be easily addressed:

  1. The FFT plans you created allocate extra memory whenever you apply them with the * operator. To avoid that you should preallocate storage for those results and use the mul! function to evalute the FFT into a buffer. This is mentioned in the docs here (which unfortunately are separated from the docs of FFTW.jl) API · AbstractFFTs.jl

  2. I see you spliced in many . do broadcast operations, however, there are a few lines where you missed one or two. Perhaps consider the @. macro instead that automatically applies . to every operation in the expression it preceeds, see Arrays · The Julia Language

  3. Try following the performance tips that were linked earlier. In particular try wrapping everything (including the non-constant parameters) into a function.

  4. Try using a profiler like BenchmarkTools.jl, PProf.jl or ProfileView.jl to get detailed infos of where your code is bottlenecked.

2 Likes

I don’t believe this code is doing what it is supposed to do or at least it is horribly inefficient in what it tries to do. I put some annotations below.

To use inplace ffts, I think it is sufficient to append a “!” to your plan_fft calls.

2 Likes

To apply a plan with no allocations, you have two choices:

  1. Use a pre-allocated output array and use mul!, e.g. mul!(output_array, p_ifft_dUdx, output_array, input_array)
  2. Create an in-place plan with plan_fft!

Note that you have lots of other unnecessary memory allocations.

  • e.g. 1im * KX .* AA .* Uc allocates two arrays, one for 1im * KX and one for for the result of the other multiplications (since the 1im * operation is not fused, see e.g. the more dots blog post). If you want to make it fully allocation-free you would also want to pre-allocate the output, e.g. @. dUdx_fourier = 1im * KX * AA * Uc (@. makes it a lot more readable), then do dUdx .= real.(p_ifft_dUdx * dUdx_fourier) with your in-place plan (via plan_fft!).
  • complex(zeros(N, N)) allocates an array of real zeros, then allocates a new array of complex zeros. To use a single allocation, do zeros(ComplexF64, N, N).
  • dU .= dUconv + dUdiff allocates an array for the right hand side, because you didn’t use .+
  • Uhat = Uhat-(KX.*Uhat + KY.*Vhat).*KX./(KX.^2 + KY.^2) allocates, because you didn’t use .+ or .=. (Would have been clearer just to use @.)
  • Uhat[isnan.(Uhat)] .= 0 allocates an array for the result of isnan.(Uhat)
  • etc.

PS. There is no need for a semicolon (;) after every statement.

PPS. See the very first performance tip in the manual: performance-critical code should be inside a function, and should not use globals. Simply putting the code inside a function is not enough if your parameters are still global variables, not function arguments.

PPPS. In general, porting Matlab code line-by-line will result in suboptimal Julia code. You take full advantage of Julia by writing in a Julia style, not in a Matlab style.

PPPPS. If your FFT outputs are always real, i.e. your FFT inputs have conjugate symmetry, then you can save a factor of two in time and memory by using the rfft interface (where you only store the non-redundant half in the Fourier domain). This is a little trickier to work with, however, so I would put that off until you’ve gotten everything else going smoothly and have a better understanding of Julia.

6 Likes