# Speeding up a nested for loop

I have a nested for-loop where the length of the internal loop is dependent on the outer loop. Any ideas on how to squeeze more performance out of this?

``````N = 500

a = rand(N)
b = rand(N)
c = rand(N)

out = zeros(N)

j_min = round(Int, 0.58N)

function foo1!(out, a, b, c, j_min, N)
@inbounds for j in eachindex(a)[j_min:end]
a_j = a[j]
@inbounds for i in eachindex(out, b, c)[begin:j]
idx_b = i
idx_c = i - j + N

bc = b[idx_b] + c[idx_c]

out[i] += a_j * bc
end
end

return nothing
end

using BenchmarkTools
@btime \$foo1!(\$out, \$a, \$b, \$c, \$j_min, \$N) # 16.750 μs (0 allocations: 0 bytes)
``````

Something easy is to try to use the `LoopVectorization.jl` package. Although this function may be memory bound already (and your access pattern doesn’t look bad) so it may or may not offer a significant speedup. If it isn’t memory bound already, then you might eek a tiny bit more out of `out[i] = muladd(a_j, bc, out[i])`.

In any case, once this is memory bound there won’t be a lot you can do.

However! It’s a little tough for me to figure out exactly what math is happening here, so forgive me if these suggestions are wrong. But it looks like there might be room for some algorithmic improvements.

• Notice that you’re adding `b[i]` to `out[i]` many times (scaled by elements of `a`). You can probably handle that part with a single loop by accumulating the scalings by `a` as you go… something like `out[i] += a_sum_so_far * b[i]`.
• The contribution from `c` looks like a convolution with `a`. If so, it should be computable via fast convolution (i.e., using a fast Fourier transform) which would be more efficient than a double-loop.

NOTE: Your accesses to `c[idx_c]` aren’t provably `@inbounds` here. You should probably have checks for that before some bad value of `N` causes trouble, That said, it looks like you may not need to pass `N` at all (at least for this example) since you can just get it from the `length` of one of the vectors.

1 Like

Before improving the runtime with the use of tools, you can do much better by improving the math.

If what you show is your intended calculation and not an oversimplification, then I would suggest the following:

• `out[i]` has `a[j]` as a common factor, do the scaling only once after the `i` loop, not within.

• the unscaled `out[i]` is the sum of slices of `b` and `c`. Compute those partial summations by adding/subtracting to the partial sums and not starting from the beginning every time.

Fixing the math will gain you much more, and afterwards you can get fancy vectorizing etc.

6 Likes

Thanks @pitsianis and @mikmoore – the algorithmic improvements are exactly what I was trying to do but couldn’t quite figure out. Breaking down the problem into parts helped a lot.

I made a running sum of `a` for the `b` contribution and made a fast dot product with `LoopVectorization` for the `c` contribution (maybe there are better convolution tools?). `foo4!` is now ~54% faster than `foo1!`.

``````function foo2!(out, a, b, c, j_min)
N = length(out)

## b section
a_sum = @views sum(a[j_min:end])  # Initialize a running sum variable

@inbounds for i in 1:j_min
out[i] += a_sum * b[i]
end

@inbounds for i in j_min+1:N
a_sum -= a[i-1]  # Update the running sum

out[i] += a_sum * b[i]
end

## c section
@inbounds for j in eachindex(a)[j_min:end]
a_j = a[j]
@inbounds for i in eachindex(out, c)[begin:j]
# @show i, j, i - j + N
out[i] += a_j * c[i - j + N]
end
end

return nothing
end

using LoopVectorization
function myconv(x, x_start, y, y_end, n)
out = 0.0
@turbo for i in 0:n-1
out += x[x_start+i] * y[y_end-i]
end

return out
end

function foo3!(out, a, b, c, j_min)
N = length(out)

## b section
a_sum = @views sum(a[j_min:end])  # Initialize a running sum variable

@inbounds for i in 1:j_min
out[i] += a_sum * b[i]
end

@inbounds for i in j_min+1:N
a_sum -= a[i-1]  # Update the running sum

out[i] += a_sum * b[i]
end

## c section
@inbounds for i in 1:j_min
out[i] += myconv(a, j_min, c, N - j_min + i, N - j_min + 1)
end

@inbounds for i in j_min+1:N-1
out[i] += myconv(a, i, c, N, N-i+1)
end
out[end] += a[end] * c[end]

return nothing
end

function foo4!(out, a, b, c, j_min)
N = length(out)

## combined b and c
a_sum = @views sum(a[j_min:end])  # Initialize a running sum variable

@inbounds for i in 1:j_min
a_sum,
b[i],
myconv(a, j_min, c, N - j_min + i, N - j_min + 1)
)
end

@inbounds for i in j_min+1:N-1
a_sum -= a[i-1]  # Update the running sum

a_sum,
b[i],
myconv(a, i, c, N, N-i+1)
)
end
a_sum -= a[end-1]
out[end] += a_sum * b[end] + a[end] * c[end]

return nothing
end

out1 = zeros(N);
out2 = zeros(N);
out3 = zeros(N);
out4 = zeros(N);

foo1!(out1, a, b, c, j_min, N);
foo2!(out2, a, b, c, j_min);
foo3!(out3, a, b, c, j_min);
foo4!(out4, a, b, c, j_min);

out1 ≈ out2 ≈ out3 ≈ out4 # true
``````
``````using BenchmarkTools
@btime \$foo1!(\$out1, \$a, \$b, \$c, \$j_min, \$N) # 17.333 μs (0 allocations: 0 bytes) (different computer)
@btime \$foo2!(\$out2, \$a, \$b, \$c, \$j_min) # 13.416 μs (0 allocations: 0 bytes)
@btime \$foo3!(\$out3, \$a, \$b, \$c, \$j_min) # 11.166 μs (0 allocations: 0 bytes)
@btime \$foo4!(\$out4, \$a, \$b, \$c, \$j_min) # 11.250 μs (0 allocations: 0 bytes)
``````
1 Like

Check `conv` from `DSP.jl` for an implementation of fast convolution. You’ll have to subselect the output to get your result (it will compute the full length output), but it will be significantly faster than a two-loop implementation.

1 Like

As @mikmoore suggested, using DSP’s `conv` the speed can be improved for larger (-ish) `N`s:

``````using DSP

function fooDSP!(out, a, b, c, j_min, N)
A,B,C = reverse(a), @view(b[end:-1:1]), @view(c[end:-1:1])
A[N-j_min+2:N] .= 0.0
@views out[N:-1:1] .= conv(A[1:N-j_min+1],C)[1:N].+cumsum(A).*B
return nothing
end
``````

Trying this with `N = 5000` was 5x faster than non-DSP implementation.

If the inputs and outputs can be generated reversed (and slightly zeroed) like the `conv` wants, a little further cleanup is possible.

2 Likes