Persistant arrays (or parameters) to avoid re-allocation

Hello there

I have a question on how we can avoid array re-allocation (without changing the API of a function) by making variables (and especially array) persistent from one function call to another, specialized on the input type and size and that belong to the scope of a function.

A simple example of a function that computes the square root of the FFT. Note that this example is representative of what I often do, small functions called a very high number of time and where pre-allocation saves a lot.

    using FFTW
    N = 1024
    sig = randn(Complex{Float64},N)
    function directSquareModulusFFT(x::Vector{Complex{T}}) where T
        # Compute FFT using FFTW
        z = fft(x)
        # Compute square modulus 
        return abs2.(z)
    end
    directSquareModulusFFT(sig)

Some modifications can be made to speed up the process and be more efficient

    using FFTW , LinearAlgebra
    N = 1024
    sig = randn(Complex{Float64},N)
    function allocatedSquareModulusFFT(x::Vector{Complex{T}}) where T
        nbSamples = length(x)
        # --- Init all stuff and prepare FFT 
	    abs_out	  = zeros(T,nbSamples)
	    planFFT   = plan_fft(copy(x);flags=FFTW.PATIENT)
	    internal  = zeros(Complex{T},nbSamples)
        # --- Compute FFT 
        mul!(internal,planFFT,x) # This is an FFT
        # --- Abs2 
        for i in 1:1:nbSamples            # Can be even faster with @avx and @inbounds but not the topic here :)
 			    abs_out[i] = abs2(internal[i])
        end
        return abs_out
    end
    allocatedSquareModulusFFT(sig)

The key aspect here is that I may call very often allocatedSquareModulusFFT so the pre-allocation should not be done in the function to save memory and time.

  using FFTW, LinearAlgebra
  N = 1024
  sig = randn(Complex{Float64},N)
  # --- Init all stuff and prepare FFT 
  abs_out	   = zeros(eltype(real(sig)),length(sig)); 
  planFFT      = plan_fft(copy(sig);flags=FFTW.PATIENT);
  internal     = similar(sig); 
  # We can now call the function with the temp variables 
    function preAllocatedSquareModulusFFT(x::Vector{Complex{T}},abs_out,planFFT,internal) where T
        nbSamples = length(x)
        # --- Compute FFT 
        mul!(internal,planFFT,x) # This is a FFT
        # --- Abs2 
        for i in 1:1:nbSamples            # Can be even faster with @avx and @inbounds but not the topic here :)
 			    abs_out[i] = abs2(internal[i])
        end
        return abs_out
    end
    preAllocatedSquareModulusFFT(sig,abs_out,planFFT,internal)

The last function is totally OK in terms of performance but require to put some code before the function call, and all the intermediate variables only depends on the type and the size of the input array.
I can of course wrap all the temp variables into a parametric structure and pass it each time I call preAllocatedSquareModulusFFT but I was wondering if there is (or we can create) a way to automatically pass (and hide :D) the intermediate variables from one call to another without performance penalty.
I was thinking on a macro for instance that would do the following

    using FFTW 
    N = 1024
    sig = randn(Complex{Float64},N)
    function magicSquareModulusFFT(x::Vector{Complex{T}}) where T
        @persistent begin 
            # --- Init all stuff and prepare FFT 
	        abs_out	  = zeros(T,nbSamples); #@MVector
	        planFFT   = plan_fft(copy(x);flags=FFTW.PATIENT);
	        internal  = similar(x);
        end
        # --- Compute FFT 
        mul!(internal,planFFT, x) # This is a FFT
        # --- Abs2 
        for i in 1:1:nbSamples            # Can be even faster with @avx and @inbounds but not the topic here :)
 			    abs_out[i] = abs2(internal[i])
        end
        return abs_out
    end
    magicSquareModulusFFT(sig)

with the code in @persistent only evaluated at the fist function call. The other time the function is called (at least with same function signature) the variables defined in @persistent are in the scope.

The ultimate goal is to hide the boilerplate associated to pre-allocation such that the high level API of a function does not require a anterior init function call.
Having the benefits of the optimisation we have made but also the simpler API definition (one input parameter and that’s all folks).
It is also appealing in protopyting as one can just play and change what is persistant without having to restart Julia in case of structure modifications or without change the parent script (with the init function).

Alternative approaches I have though about (without good results :()

  • Create a initSquareModulusFFT(x) function and then call it into the magicSquareModulusFFT using memoize. It seems that the performance penalty is huge as the Dict is Any (?)
  • I have seen that another approach can be to use closure with let but it will not (I think ?) be adapted to the array type and size. (see feature request: local static variables · Issue #15056 · JuliaLang/julia · GitHub)
  • Create a structure with all the intermediate parameters and then using the structure as a callable function (call overloading). But I don’t really how the boilerplate can be done (without init here in the parent function) (see Emulate local static variable?)

So, any remarks would be welcome here

TensorOperations does something like this, so that when you ask it for A * B * C, the intermediate storage for A * B is kept and re-used (for the next call of the same size & type). This uses LRUCache.jl to make sure it won’t eat all your memory, although perhaps if your arrays are smaller & you don’t care about threading you could just use a Dict.

If you only ever want say Matrix{ComplexF64} then types can be easy; it not then I think you may be able to wrap the business of dealing with the Dict in a function (which either creates or looks up) which is type-stable overall. This function may also want to skip the Dict below some threshold size.

1 Like

Any approach which tries to automatically make the data persistent will introduce a bunch of “fun” opportunities for error, like thread safety (what happens if two threads try to call this function at the same time?) and debugging (how can you test the contents of the persistent data if it’s totally hidden?).

By far my favorite way to handle this situation is:

  1. Create a Workspace struct holding the relevant pre-allocated stuff
  2. Change the function signature to `foo!(workspace::Workspace, x::Vector)
  3. Create a convenience function foo(x) = foo(Workspace(x), x)

Then in performance-critical code, you can pre-allocate the Workspace and keep it around, calling foo!(workspace, x) every time. But for testing or simple interactive usage, you can call foo(x) which will construct a temporary Workspace for you.

This avoids any need for metaprogramming, makes thread-safety easy, and lets you test the contents of the Workspace after the fact if necessary.

12 Likes

I personally would prefer your option 3:

Create a structure with all the intermediate parameters and then using the structure as a callable function (call overloading). But I don’t really how the boilerplate can be done (without init here in the parent function) (see Emulate local static variable?)

It stores the persistent arrays in a transparent manner and leads to nice code. The boilerplate could look like this:

using FFTW, LinearAlgebra
struct preAllocatedSquareModulusFFT{T,P,U}
   absout::Vector{T} 
   planFFT::P
   internal::Vector{U}
end
function preAllocatedSquareModulusFFT(sig)
    abs_out   = zeros(eltype(real(sig)),length(sig)); 
    planFFT      = plan_fft(copy(sig);flags=FFTW.PATIENT);
    internal     = similar(sig)
    preAllocatedSquareModulusFFT(abs_out, planFFT, internal)
end

function (f::(preAllocatedSquareModulusFFT))(x)
    nbSamples = length(x)
    # --- Compute FFT 
    mul!(f.internal,f.planFFT,x) # This is a FFT
    # --- Abs2 
    for i in 1:1:nbSamples            # Can be even faster with @avx and @inbounds but not the topic here :)
        f.absout[i] = abs2(f.internal[i])
    end
    return f.absout
end

You would then call the function as follows:

N = 1024
sig = randn(Complex{Float64},N)

# Create the struct once
f = preAllocatedSquareModulusFFT(sig) 

# And call (multiple times)
f(sig)
2 Likes

Fair point ! Note that by hidden I mean that you do not have to put into the function call but you see it in the function scope

Yes ! this is exactly how I do it right now, and in performance critical code, I always pre-allocate Workspace . Just wondering if I there is a fancy way to instantiate and call as it is.