# Persistant arrays (or parameters) to avoid re-allocation

Hello there

I have a question on how we can avoid array re-allocation (without changing the API of a function) by making variables (and especially array) persistent from one function call to another, specialized on the input type and size and that belong to the scope of a function.

A simple example of a function that computes the square root of the FFT. Note that this example is representative of what I often do, small functions called a very high number of time and where pre-allocation saves a lot.

``````    using FFTW
N = 1024
sig = randn(Complex{Float64},N)
function directSquareModulusFFT(x::Vector{Complex{T}}) where T
# Compute FFT using FFTW
z = fft(x)
# Compute square modulus
return abs2.(z)
end
directSquareModulusFFT(sig)
``````

Some modifications can be made to speed up the process and be more efficient

``````    using FFTW , LinearAlgebra
N = 1024
sig = randn(Complex{Float64},N)
function allocatedSquareModulusFFT(x::Vector{Complex{T}}) where T
nbSamples = length(x)
# --- Init all stuff and prepare FFT
abs_out	  = zeros(T,nbSamples)
planFFT   = plan_fft(copy(x);flags=FFTW.PATIENT)
internal  = zeros(Complex{T},nbSamples)
# --- Compute FFT
mul!(internal,planFFT,x) # This is an FFT
# --- Abs2
for i in 1:1:nbSamples            # Can be even faster with @avx and @inbounds but not the topic here :)
abs_out[i] = abs2(internal[i])
end
return abs_out
end
allocatedSquareModulusFFT(sig)
``````

The key aspect here is that I may call very often `allocatedSquareModulusFFT` so the pre-allocation should not be done in the function to save memory and time.

``````  using FFTW, LinearAlgebra
N = 1024
sig = randn(Complex{Float64},N)
# --- Init all stuff and prepare FFT
abs_out	   = zeros(eltype(real(sig)),length(sig));
planFFT      = plan_fft(copy(sig);flags=FFTW.PATIENT);
internal     = similar(sig);
# We can now call the function with the temp variables
function preAllocatedSquareModulusFFT(x::Vector{Complex{T}},abs_out,planFFT,internal) where T
nbSamples = length(x)
# --- Compute FFT
mul!(internal,planFFT,x) # This is a FFT
# --- Abs2
for i in 1:1:nbSamples            # Can be even faster with @avx and @inbounds but not the topic here :)
abs_out[i] = abs2(internal[i])
end
return abs_out
end
preAllocatedSquareModulusFFT(sig,abs_out,planFFT,internal)
``````

The last function is totally OK in terms of performance but require to put some code before the function call, and all the intermediate variables only depends on the type and the size of the input array.
I can of course wrap all the temp variables into a parametric structure and pass it each time I call `preAllocatedSquareModulusFFT` but I was wondering if there is (or we can create) a way to automatically pass (and hide :D) the intermediate variables from one call to another without performance penalty.
I was thinking on a macro for instance that would do the following

``````    using FFTW
N = 1024
sig = randn(Complex{Float64},N)
function magicSquareModulusFFT(x::Vector{Complex{T}}) where T
@persistent begin
# --- Init all stuff and prepare FFT
abs_out	  = zeros(T,nbSamples); #@MVector
planFFT   = plan_fft(copy(x);flags=FFTW.PATIENT);
internal  = similar(x);
end
# --- Compute FFT
mul!(internal,planFFT, x) # This is a FFT
# --- Abs2
for i in 1:1:nbSamples            # Can be even faster with @avx and @inbounds but not the topic here :)
abs_out[i] = abs2(internal[i])
end
return abs_out
end
magicSquareModulusFFT(sig)
``````

with the code in `@persistent` only evaluated at the fist function call. The other time the function is called (at least with same function signature) the variables defined in `@persistent` are in the scope.

The ultimate goal is to hide the boilerplate associated to pre-allocation such that the high level API of a function does not require a anterior `init` function call.
Having the benefits of the optimisation we have made but also the simpler API definition (one input parameter and that’s all folks).
It is also appealing in protopyting as one can just play and change what is persistant without having to restart Julia in case of structure modifications or without change the parent script (with the init function).

Alternative approaches I have though about (without good results :()

• Create a `initSquareModulusFFT(x)` function and then call it into the `magicSquareModulusFFT` using memoize. It seems that the performance penalty is huge as the `Dict` is `Any` (?)
• I have seen that another approach can be to use closure with `let` but it will not (I think ?) be adapted to the array type and size. (see feature request: local static variables · Issue #15056 · JuliaLang/julia · GitHub)
• Create a structure with all the intermediate parameters and then using the structure as a callable function (call overloading). But I don’t really how the boilerplate can be done (without init here in the parent function) (see Emulate local static variable?)

So, any remarks would be welcome here

TensorOperations does something like this, so that when you ask it for `A * B * C`, the intermediate storage for `A * B` is kept and re-used (for the next call of the same size & type). This uses LRUCache.jl to make sure it won’t eat all your memory, although perhaps if your arrays are smaller & you don’t care about threading you could just use a `Dict`.

If you only ever want say `Matrix{ComplexF64}` then types can be easy; it not then I think you may be able to wrap the business of dealing with the `Dict` in a function (which either creates or looks up) which is type-stable overall. This function may also want to skip the Dict below some threshold size.

1 Like

Any approach which tries to automatically make the data persistent will introduce a bunch of “fun” opportunities for error, like thread safety (what happens if two threads try to call this function at the same time?) and debugging (how can you test the contents of the persistent data if it’s totally hidden?).

By far my favorite way to handle this situation is:

1. Create a `Workspace` struct holding the relevant pre-allocated stuff
2. Change the function signature to `foo!(workspace::Workspace, x::Vector)
3. Create a convenience function `foo(x) = foo(Workspace(x), x)`

Then in performance-critical code, you can pre-allocate the Workspace and keep it around, calling `foo!(workspace, x)` every time. But for testing or simple interactive usage, you can call `foo(x)` which will construct a temporary `Workspace` for you.

This avoids any need for metaprogramming, makes thread-safety easy, and lets you test the contents of the `Workspace` after the fact if necessary.

12 Likes

I personally would prefer your option 3:

Create a structure with all the intermediate parameters and then using the structure as a callable function (call overloading). But I don’t really how the boilerplate can be done (without init here in the parent function) (see Emulate local static variable?)

It stores the persistent arrays in a transparent manner and leads to nice code. The boilerplate could look like this:

``````using FFTW, LinearAlgebra
struct preAllocatedSquareModulusFFT{T,P,U}
absout::Vector{T}
planFFT::P
internal::Vector{U}
end
function preAllocatedSquareModulusFFT(sig)
abs_out   = zeros(eltype(real(sig)),length(sig));
planFFT      = plan_fft(copy(sig);flags=FFTW.PATIENT);
internal     = similar(sig)
preAllocatedSquareModulusFFT(abs_out, planFFT, internal)
end

function (f::(preAllocatedSquareModulusFFT))(x)
nbSamples = length(x)
# --- Compute FFT
mul!(f.internal,f.planFFT,x) # This is a FFT
# --- Abs2
for i in 1:1:nbSamples            # Can be even faster with @avx and @inbounds but not the topic here :)
f.absout[i] = abs2(f.internal[i])
end
return f.absout
end
``````

You would then call the function as follows:

``````N = 1024
sig = randn(Complex{Float64},N)

# Create the struct once
f = preAllocatedSquareModulusFFT(sig)

# And call (multiple times)
f(sig)
``````
2 Likes

Fair point ! Note that by hidden I mean that you do not have to put into the function call but you see it in the function scope

Yes ! this is exactly how I do it right now, and in performance critical code, I always pre-allocate `Workspace` . Just wondering if I there is a fancy way to instantiate and call as it is.