About In-line-fft

Take1234 · October 24, 2023, 7:41am

Hi Experts ,I’m just studing Julia But I’m in truble. Berrow 2D turblence simulation code is that I translated from Matlab is use too much memory. At my Simulation environment it use approx. 32 GB. I found some info about In-Line fft for saving memory. But It is not working well, I mean not saving memory. I would like to borrow your help. Sorry for my bad English

using FFTW
using Printf
using LinearAlgebra        
using MAT
# set parameters
N               = 128                      ;    # N*N grid,  
dt              = 0.02                     ;    # timestep
simutimeSeconds = 10                       ;

HowOftenSave    = 10                       ;    # save velocity field per 'HowOftenSave' timesteps
HowOftenVisu    = 10                       ;    # show illustration per 'HowOftenVisu' timesteps
simutimeSteps   = round(simutimeSeconds/dt);

nu              = 1/1000                   ;    # kinematic viscosity [nu] = m^2/s
L               = 2*pi                     ;    # domain side length
dx              = L/N                      ;    # grid spacing assumed dx=dy

Umax            = 1                        ;
F_0             = 0.03                     ;

# create fields
x = (L-dx)*(0:(N-1))/(N-1)                 ;    # to be precise, dx is subtracted here 
y = (L-dx)*(0:(N-1))/(N-1)                 ;    # otherwise the derivative of e.g. sin(x) wouldnt be differentiable on this periodic grid  
X=zeros(N,N);
Y=zeros(N,N);
for jx=1:N
    for ix=1:N
        X[ix,jx]=x[jx]
        Y[ix,jx]=y[ix]
    end
end

# wavenumbers
kx1 = mod.(1/2 .+ (0:(N-1))/N, 1) .- 1/2;
ky1 = mod.(1/2 .+ (0:(N-1))/N, 1) .- 1/2;
kx  = kx1*(2*pi/dx); # wavenumbers
ky  = ky1*(2*pi/dx); # wavenumbers

KX=zeros(N,N);
KY=zeros(N,N);
for jkx=1:N
    for ikx=1:N
        KX[ikx,jkx]=kx[jkx]
        KY[ikx,jkx]=ky[ikx]
    end
end

# Anti-aliasing filter based on the 2/3*kNyq rule
AA = (abs.(KX) .< (2/3) * maximum(kx)) .* (abs.(KY) .< (2/3) * maximum(ky)) 

# Runge-Kutta 4 coefficients
a=[1/6 1/3 1/3 1/6]; b=[1/2 1/2 1 1];

#--  Taylor-Green --#
kmax = 4;
U = -Umax .* cos.(kmax .* X) .* sin.(kmax .* Y)
V =  Umax .* sin.(kmax .* X) .* cos.(kmax .* Y)
U = U + randn(N,N)*F_0;
V = V + randn(N,N)*F_0;

Uhat = fft(U)
Vhat = fft(V)

Uhat = Uhat - (KX .* Uhat + KY .* Vhat) .* KX ./ (KX .^ 2 + KY .^ 2)
Uhat[isnan.(Uhat)] .= 0

Vhat = Vhat - (KX .* Uhat + KY .* Vhat) .* KY ./ (KX .^ 2 + KY .^ 2)
Vhat[isnan.(Vhat)] .= 0

# U = p_ifft * Uhat;
# V = p_ifft * Vhat;

matwrite("uv0000.mat", Dict("Uhat" => U, "Vhat" => V))

dUdx = zeros(N, N)
dUdy = zeros(N, N)
dVdx = zeros(N, N)
dVdy = zeros(N, N)
dUconv = complex(zeros(N, N))
dVconv = complex(zeros(N, N))
dUdiff = complex(zeros(N, N))
dVdiff = complex(zeros(N, N))
dU = complex(zeros(N, N))
dV = complex(zeros(N, N))
Uc = complex(zeros(N, N))
Vc = complex(zeros(N, N))
Uold = complex(zeros(N, N))
Vold = complex(zeros(N, N))


p_fft_U = plan_fft(U; flags=FFTW.ESTIMATE)
p_ifft_Uhat = plan_ifft(Uhat; flags=FFTW.ESTIMATE)
p_fft_V = plan_fft(V; flags=FFTW.ESTIMATE)
p_ifft_Vhat = plan_fft(Vhat; flags=FFTW.ESTIMATE)
p_ifft_dUdx = plan_ifft((1im * KX .* AA .* Uc); flags=FFTW.ESTIMATE)
p_ifft_dUdy = plan_ifft((1im * KY .* AA .* Uc); flags=FFTW.ESTIMATE)
p_ifft_dVdx = plan_ifft((1im * KX .* AA .* Vc); flags=FFTW.ESTIMATE)
p_ifft_dVdy = plan_ifft((1im * KY .* AA .* Vc); flags=FFTW.ESTIMATE)
p_fft_dUconv = plan_fft((dt*(-U .* dUdx -V .* dUdy) / 2); flags=FFTW.ESTIMATE)
p_fft_dVconv = plan_fft((dt*(-U .* dVdx -V .* dVdy) / 2); flags=FFTW.ESTIMATE)

# Navier-Stokes simulation begins
@time begin
for t=1:simutimeSteps
    @printf("| Time step # %04d |\n",t)

    #SolveNavierStokes2D;
    Uold = Uhat; 
    Vold = Vhat;
    Uc   = Uhat; 
    Vc   = Vhat;

    for rk in 1:4

        U = real.(p_ifft_Uhat*(AA .* Uc))
        V = real.(p_ifft_Vhat*(AA .* Vc))

        
        dUdx .= real.(p_ifft_dUdx * (1im * KX .* AA .* Uc));
        dUdy .= real.(p_ifft_dUdy * (1im * KY .* AA .* Uc));
        dVdx .= real.(p_ifft_dVdx * (1im * KX .* AA .* Vc));
        dVdy .= real.(p_ifft_dVdy * (1im * KY .* AA .* Vc));
        dUconv .= p_fft_dUconv*(dt*(-U .* dUdx -V .* dUdy) / 2);
        dVconv .= p_fft_dVconv*(dt*(-U .* dVdx -V .* dVdy) / 2);
        
        dUdiff .= nu * dt * ((-KX .* KX .* Uc) + (- KY .* KY .* Uc)) ;
        dVdiff .= nu * dt * ((-KX .* KX .* Vc) + (- KY .* KY .* Vc)) ;

        dU .= dUconv + dUdiff;
        dV .= dVconv + dVdiff;

        Uhat = Uhat-(KX.*Uhat + KY.*Vhat).*KX./(KX.^2 + KY.^2); 
        Uhat[isnan.(Uhat)] .= 0;
        Vhat = Vhat-(KX.*Uhat + KY.*Vhat).*KY./(KX.^2 + KY.^2); 
        Vhat[isnan.(Vhat)] .= 0;

        if rk < 4
            Uhat = Uold + b[rk] * dU
            Vhat = Vold + b[rk] * dV
        end        

        Uc = Uc + a[rk] * dU;
        Vc = Vc + a[rk] * dV;

    end

    Uhat = Uc
    Vhat = Vc

    Uhat = Uhat - ((KX .* Uhat + KY .* Vhat) .* KX ./ (KX .^ 2 + KY .^ 2))
    Uhat[isnan.(Uhat)] .= 0

    Vhat = Vhat - ((KX .* Uhat + KY .* Vhat) .* KY ./ (KX .^ 2 + KY .^ 2))
    Vhat[isnan.(Vhat)] .= 0

    U = real.(p_ifft_Uhat*Uhat);  
    V = real.(p_ifft_Vhat*Vhat);

    # SaveInstantField2D
    if mod(t, HowOftenSave) == 0
    filename = "uv$(@sprintf("%04d", t)).mat"
    matwrite(filename, Dict("U" => U, "V" => V))
    end

end 
end

nilshg · October 24, 2023, 8:22am

Have you read

https://docs.julialang.org/en/v1/manual/performance-tips/

?

It seems that you are running your code in global scope, which is terrible for performance and will lead to excessive allocations. Make sure that your code is wrapped in a function when you run it.

Also I’m not a differential equation expert but the references to “Runge-Kutta coefficients” in your code make me suspect that you are rolling your own rk4 implementation here? If that’s the case I would concur with Oscar’s advice in a thread a few weeks ago:

Take1234 · October 24, 2023, 8:51am

Thanks for the advice.
I have tried to enclose all code in functions.
However, nothing has changed.
Specifically, I wrapped everything from “Setparameter” to “SaveInstantFields2D” in functions.
Then call function for simulation.

I honestory want to run rk4 from DefferentialEquattion.jl. But my Lab’s rk4 code progression is little unique so I cannot write rk4 from pakkage…

nilshg · October 24, 2023, 9:25am

In that case I suppose you will have to start profiling your code to see where the time is spent and allocations are coming from using something like

Here’s a whole recent thread on profiling allocations:

fatteneder · October 24, 2023, 11:41am

There are a few performance gotaches in your code that can be easily addressed:

The FFT plans you created allocate extra memory whenever you apply them with the * operator. To avoid that you should preallocate storage for those results and use the mul! function to evalute the FFT into a buffer. This is mentioned in the docs here (which unfortunately are separated from the docs of FFTW.jl) API · AbstractFFTs.jl
I see you spliced in many . do broadcast operations, however, there are a few lines where you missed one or two. Perhaps consider the @. macro instead that automatically applies . to every operation in the expression it preceeds, see Arrays · The Julia Language
Try following the performance tips that were linked earlier. In particular try wrapping everything (including the non-constant parameters) into a function.
Try using a profiler like BenchmarkTools.jl, PProf.jl or ProfileView.jl to get detailed infos of where your code is bottlenecked.

abraemer · October 24, 2023, 11:57am

I don’t believe this code is doing what it is supposed to do or at least it is horribly inefficient in what it tries to do. I put some annotations below.

Take1234:

# note that this does not copy the arrays or anything.
# you probably want to copy the matrices here and then overwrite them in the loop
Uold = Uhat; 
Vold = Vhat;
Uc   = Uhat; 
Vc   = Vhat;

for rk in 1:4

    # all of the following lines allocate intermediate arrays because not every operation is a broadcast
    # to avoid allocations you'd need some temporary workspaces
    U = real.(p_ifft_Uhat*(AA .* Uc))
    V = real.(p_ifft_Vhat*(AA .* Vc))

    # why do you multiply with 1im here? FFT is linear, so you could do "-imag(...)" instead of "real(1im *...)"
    # or should the result be purely real? maybe look into FFTW.irfft
    dUdx .= real.(p_ifft_dUdx * (1im * KX .* AA .* Uc)); 
    dUdy .= real.(p_ifft_dUdy * (1im * KY .* AA .* Uc));
    dVdx .= real.(p_ifft_dVdx * (1im * KX .* AA .* Vc));
    dVdy .= real.(p_ifft_dVdy * (1im * KY .* AA .* Vc));
    dUconv .= p_fft_dUconv*(dt*(-U .* dUdx -V .* dUdy) / 2);
    dVconv .= p_fft_dVconv*(dt*(-U .* dVdx -V .* dVdy) / 2);
    
    # this allocates because not every operation is broadcasted
    dUdiff .= nu * dt * ((-KX .* KX .* Uc) + (- KY .* KY .* Uc)) ;
    dVdiff .= nu * dt * ((-KX .* KX .* Vc) + (- KY .* KY .* Vc)) ;
    # do instead: 
    # @. dUdiff = nu * dt * ((-KX * KX * Uc) + (- KY * KY * Uc)) ;
    # "@." means that all operations are broadcasted

    # this also allocates because you did not use ".+"
    dU .= dUconv + dUdiff; # do: dU .= dUconv .+ dUdiff
    dV .= dVconv + dVdiff;

    Uhat = Uhat-(KX.*Uhat + KY.*Vhat).*KX./(KX.^2 + KY.^2);  # this allocates a new matrix
    Uhat[isnan.(Uhat)] .= 0;
    Vhat = Vhat-(KX.*Uhat + KY.*Vhat).*KY./(KX.^2 + KY.^2);  # this allocates a new matrix
    Vhat[isnan.(Vhat)] .= 0;

    if rk < 4
        # here you also allocate new matrices and just throw away what you computed above (except for when rk=4)
        Uhat = Uold + b[rk] * dU 
        Vhat = Vold + b[rk] * dV
    end        

    # here again you are not using in-place operations
    Uc = Uc + a[rk] * dU;
    Vc = Vc + a[rk] * dV;

end

To use inplace ffts, I think it is sufficient to append a “!” to your plan_fft calls.

stevengj · October 24, 2023, 12:32pm

To apply a plan with no allocations, you have two choices:

Use a pre-allocated output array and use mul!, e.g. mul!(output_array, p_ifft_dUdx, output_array, input_array)
Create an in-place plan with plan_fft!

Note that you have lots of other unnecessary memory allocations.

e.g. 1im * KX .* AA .* Uc allocates two arrays, one for 1im * KX and one for for the result of the other multiplications (since the 1im * operation is not fused, see e.g. the more dots blog post). If you want to make it fully allocation-free you would also want to pre-allocate the output, e.g. @. dUdx_fourier = 1im * KX * AA * Uc (@. makes it a lot more readable), then do dUdx .= real.(p_ifft_dUdx * dUdx_fourier) with your in-place plan (via plan_fft!).
complex(zeros(N, N)) allocates an array of real zeros, then allocates a new array of complex zeros. To use a single allocation, do zeros(ComplexF64, N, N).
dU .= dUconv + dUdiff allocates an array for the right hand side, because you didn’t use .+
Uhat = Uhat-(KX.*Uhat + KY.*Vhat).*KX./(KX.^2 + KY.^2) allocates, because you didn’t use .+ or .=. (Would have been clearer just to use @.)
Uhat[isnan.(Uhat)] .= 0 allocates an array for the result of isnan.(Uhat)
etc.

PS. There is no need for a semicolon (;) after every statement.

PPS. See the very first performance tip in the manual: performance-critical code should be inside a function, and should not use globals. Simply putting the code inside a function is not enough if your parameters are still global variables, not function arguments.

PPPS. In general, porting Matlab code line-by-line will result in suboptimal Julia code. You take full advantage of Julia by writing in a Julia style, not in a Matlab style.

PPPPS. If your FFT outputs are always real, i.e. your FFT inputs have conjugate symmetry, then you can save a factor of two in time and memory by using the rfft interface (where you only store the non-redundant half in the Fourier domain). This is a little trickier to work with, however, so I would put that off until you’ve gotten everything else going smoothly and have a better understanding of Julia.

Topic		Replies	Views
How I can change script for speed up General Usage	4	490	October 4, 2023
Repeated allocations for fft in pseudospectral code Performance fftw , fft	1	94	July 4, 2025
Scaled FFT implementation seems to only use one thread despite setting no. of threads to 8 New to Julia fftw , multithreading	2	491	October 27, 2022
Performance advice needed Performance fftw	17	1319	January 17, 2020
Optimizing a Spectral Proper Orthogonal Decomposition Method Performance	19	694	May 6, 2022

About In-line-fft

Related topics