Reducing GC time in loop using FLoops.jl

I’m trying to speed up the runtime of a back-projection image formation algorithm for SAR and want to see how far things can be pushed for the naive version using parallel computing and the speed of Julia (I’m aware that there are more efficient factorized versions of this algorithm).

Here is a MWE:

module Test

using LinearAlgebra
using Interpolations
using FLoops
using BenchmarkTools

const c = 3e8

function backprojection(itp, positions, xs, ys, fcomp)
    @floop for (k,q) in enumerate(positions)
        p = zeros(3)
        subimage = zeros(Complex{Float64}, length(xs), length(ys))

        for (i,x) in enumerate(xs), (j,y) in enumerate(ys)
            # point coordinates in the image plane
            @inbounds p[1] = x
            @inbounds p[2] = y
            # range between radar and image point
            r = norm(p - q)
            # corresponding propagation delay
            td = 2*r/c
            # remove initial carrier phase and add contribution
            # of this radar pulse to image point
            @inbounds subimage[i, j] += itp(k, td) * fcomp(td)
        end

        @reduce(image .+= subimage)
    end
    image
end

function run(M, N)
    # dummy data
    data = randn(Complex{Float64}, M, 4096)
    
    # sample points for interpolation
    ax1 = 1:size(data, 1)
    ax2 = 1:size(data, 2)

    # x and y image points
    xs = range(-2.0, 2.0, length=N)
    ys = range(-2.0, 2.0, length=N)

    # dummy position vectors
    positions = [randn(3) for _ in ax1]

    # dummy compensation function
    fcomp(td) = exp(2im*pi*td)

    # interpolation object for data array
    itp = interpolate((ax1, ax2), data, (NoInterp(), Gridded(Linear())))

    # compute backprojection sum
    @benchmark backprojection($itp, $positions, $xs, $ys, $fcomp)
end

end

Benchmarking it using BenchmarkTools.jl gives the following result

$ julia -t 4
julia> using Revise
julia> include("Test.jl")
julia> Test.run(400, 256)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  913.720 ms … 951.016 ms  ┊ GC (min … max): 35.12% … 37.45%
 Time  (median):     929.735 ms               ┊ GC (median):    36.39%
 Time  (mean ± σ):   931.355 ms ±  15.284 ms  ┊ GC (mean ± σ):  36.32% ±  0.93%

  ▁            ▁                                     █
  █▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁ ▁
  914 ms           Histogram: frequency by time          944 ms <

 Memory estimate: 3.51 GiB, allocs estimate: 26216432.

It seems like a large amount of time is still spent on garbage collection. I’m afraid it might have to do with the allocation of the subimage array in the backprojection function.
There might be a way to avoid allocating subimage in every iteration of the loop but alas the API for FLoops.jl is a little confusing to me so I’m not sure how exactly that would work.

Is there a way to reduce the GC time in this particular function?

Thanks!

pull this out of for-loop and fill!(subimage, 0.0im) inside loop?

Well, the best I’ve managed to come up with for now seems to be this

function backprojection(itp, positions, xs, ys, fcomp; c=3e8)
    @floop for (k,q) in enumerate(positions)
        @init subimage = Matrix{Complex{Float64}}(undef, length(xs), length(ys))
        @init p = zeros(3)

        fill!(subimage, 0.0im)

        for (i,x) in enumerate(xs), (j,y) in enumerate(ys)
            # point coordinates in the image plane
            @inbounds p[1] = x
            @inbounds p[2] = y
            # range between radar and image point
            r = norm(p - q)
            # corresponding propagation delay
            td = 2*r/c
            # remove initial carrier phase and add contribution
            # of this radar pulse to image point
            @inbounds subimage[i, j] += itp(k, td) * fcomp(td)
        end

        @reduce(image += subimage)
    end
    image
end

A lot of the garbage collection time also seems to stem from the multi-threading as well, but the number of allocations stays the same. I have no idea how else to optimize this.

$ julia
julia> Test.run(400,256)
BenchmarkTools.Trial: 3 samples with 1 evaluation.
 Range (min … max):  1.915 s …  1.927 s  ┊ GC (min … max): 3.74% … 3.84%
 Time  (median):     1.916 s             ┊ GC (median):    3.86%
 Time  (mean ± σ):   1.919 s ± 6.565 ms  ┊ GC (mean ± σ):  3.81% ± 0.06%

  █ █                                                    █
  █▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.91 s        Histogram: frequency by time        1.93 s <

 Memory estimate: 3.13 GiB, allocs estimate: 26215209.
$ julia -t 4
julia> Test.run(400,256)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
 Range (min … max):  878.112 ms … 915.199 ms  ┊ GC (min … max): 35.36% … 36.69%
 Time  (median):     900.028 ms               ┊ GC (median):    35.92%
 Time  (mean ± σ):   899.184 ms ±  12.647 ms  ┊ GC (mean ± σ):  36.06% ±  0.61%

  █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  900 ms           Histogram: frequency by time          915 ms <

 Memory estimate: 3.13 GiB, allocs estimate: 26215260.