Speeding up a nested for loop

MarcBerliner · September 22, 2023, 12:14am

I have a nested for-loop where the length of the internal loop is dependent on the outer loop. Any ideas on how to squeeze more performance out of this?

N = 500

a = rand(N)
b = rand(N)
c = rand(N)

out = zeros(N)

j_min = round(Int, 0.58N)

function foo1!(out, a, b, c, j_min, N)
    @inbounds for j in eachindex(a)[j_min:end]
        a_j = a[j]
        @inbounds for i in eachindex(out, b, c)[begin:j]
            idx_b = i
            idx_c = i - j + N

            bc = b[idx_b] + c[idx_c]
            
            out[i] += a_j * bc
        end
    end

    return nothing
end

using BenchmarkTools
@btime $foo1!($out, $a, $b, $c, $j_min, $N) # 16.750 μs (0 allocations: 0 bytes)

mikmoore · September 22, 2023, 4:12pm

Something easy is to try to use the LoopVectorization.jl package. Although this function may be memory bound already (and your access pattern doesn’t look bad) so it may or may not offer a significant speedup. If it isn’t memory bound already, then you might eek a tiny bit more out of out[i] = muladd(a_j, bc, out[i]).

In any case, once this is memory bound there won’t be a lot you can do.

However! It’s a little tough for me to figure out exactly what math is happening here, so forgive me if these suggestions are wrong. But it looks like there might be room for some algorithmic improvements.

Notice that you’re adding b[i] to out[i] many times (scaled by elements of a). You can probably handle that part with a single loop by accumulating the scalings by a as you go… something like out[i] += a_sum_so_far * b[i].
The contribution from c looks like a convolution with a. If so, it should be computable via fast convolution (i.e., using a fast Fourier transform) which would be more efficient than a double-loop.

NOTE: Your accesses to c[idx_c] aren’t provably @inbounds here. You should probably have checks for that before some bad value of N causes trouble, That said, it looks like you may not need to pass N at all (at least for this example) since you can just get it from the length of one of the vectors.

pitsianis · September 22, 2023, 4:48pm

Before improving the runtime with the use of tools, you can do much better by improving the math.

If what you show is your intended calculation and not an oversimplification, then I would suggest the following:

out[i] has a[j] as a common factor, do the scaling only once after the i loop, not within.
the unscaled out[i] is the sum of slices of b and c. Compute those partial summations by adding/subtracting to the partial sums and not starting from the beginning every time.

Fixing the math will gain you much more, and afterwards you can get fancy vectorizing etc.

MarcBerliner · September 23, 2023, 2:00pm

Thanks @pitsianis and @mikmoore – the algorithmic improvements are exactly what I was trying to do but couldn’t quite figure out. Breaking down the problem into parts helped a lot.

I made a running sum of a for the b contribution and made a fast dot product with LoopVectorization for the c contribution (maybe there are better convolution tools?). foo4! is now ~54% faster than foo1!.

function foo2!(out, a, b, c, j_min)
    N = length(out)

    ## b section
    a_sum = @views sum(a[j_min:end])  # Initialize a running sum variable

    @inbounds for i in 1:j_min
        out[i] += a_sum * b[i]
    end

    @inbounds for i in j_min+1:N
        a_sum -= a[i-1]  # Update the running sum
        
        out[i] += a_sum * b[i]
    end

    ## c section
    @inbounds for j in eachindex(a)[j_min:end]
        a_j = a[j]
        @inbounds for i in eachindex(out, c)[begin:j]
            # @show i, j, i - j + N
            out[i] += a_j * c[i - j + N]
        end
    end

    return nothing
end

using LoopVectorization
function myconv(x, x_start, y, y_end, n)
    out = 0.0
    @turbo for i in 0:n-1
        out += x[x_start+i] * y[y_end-i]
    end
    
    return out
end

function foo3!(out, a, b, c, j_min)
    N = length(out)

    ## b section
    a_sum = @views sum(a[j_min:end])  # Initialize a running sum variable

    @inbounds for i in 1:j_min
        out[i] += a_sum * b[i]
    end

    @inbounds for i in j_min+1:N
        a_sum -= a[i-1]  # Update the running sum
        
        out[i] += a_sum * b[i]
    end

    ## c section
    @inbounds for i in 1:j_min
        out[i] += myconv(a, j_min, c, N - j_min + i, N - j_min + 1)
    end

    @inbounds for i in j_min+1:N-1
        out[i] += myconv(a, i, c, N, N-i+1)
    end
    out[end] += a[end] * c[end]

    return nothing
end

function foo4!(out, a, b, c, j_min)
    N = length(out)

    ## combined b and c
    a_sum = @views sum(a[j_min:end])  # Initialize a running sum variable

    @inbounds for i in 1:j_min
        out[i] += muladd(
            a_sum,
            b[i],
            myconv(a, j_min, c, N - j_min + i, N - j_min + 1)
        )
    end

    @inbounds for i in j_min+1:N-1
        a_sum -= a[i-1]  # Update the running sum
        
        out[i] += muladd(
            a_sum,
            b[i],
            myconv(a, i, c, N, N-i+1)
        )
    end
    a_sum -= a[end-1]
    out[end] += a_sum * b[end] + a[end] * c[end]

    return nothing
end

out1 = zeros(N);
out2 = zeros(N);
out3 = zeros(N);
out4 = zeros(N);

foo1!(out1, a, b, c, j_min, N);
foo2!(out2, a, b, c, j_min);
foo3!(out3, a, b, c, j_min);
foo4!(out4, a, b, c, j_min);

out1 ≈ out2 ≈ out3 ≈ out4 # true

using BenchmarkTools
@btime $foo1!($out1, $a, $b, $c, $j_min, $N) # 17.333 μs (0 allocations: 0 bytes) (different computer)
@btime $foo2!($out2, $a, $b, $c, $j_min) # 13.416 μs (0 allocations: 0 bytes)
@btime $foo3!($out3, $a, $b, $c, $j_min) # 11.166 μs (0 allocations: 0 bytes)
@btime $foo4!($out4, $a, $b, $c, $j_min) # 11.250 μs (0 allocations: 0 bytes)

mikmoore · September 23, 2023, 4:07pm

Check conv from DSP.jl for an implementation of fast convolution. You’ll have to subselect the output to get your result (it will compute the full length output), but it will be significantly faster than a two-loop implementation.

Dan · September 23, 2023, 10:06pm

As @mikmoore suggested, using DSP’s conv the speed can be improved for larger (-ish) Ns:

using DSP

function fooDSP!(out, a, b, c, j_min, N)
    A,B,C = reverse(a), @view(b[end:-1:1]), @view(c[end:-1:1])
    A[N-j_min+2:N] .= 0.0
    @views out[N:-1:1] .= conv(A[1:N-j_min+1],C)[1:N].+cumsum(A).*B
    return nothing
end

Trying this with N = 5000 was 5x faster than non-DSP implementation.

If the inputs and outputs can be generated reversed (and slightly zeroed) like the conv wants, a little further cleanup is possible.