Nested for loops vs Iterators.product performance

Assume I have some vectors a,b,c… and I want to compute
\sum_{ijk..} a_i b_j c_k ...

For 3 lists, I can do it using nested for loops :

function sumprod1(a, b, c)
    s = 0
    for i in 1:length(a)
        for j in 1:length(b)
            for k in 1:length(c)
                s += a[i]*b[j]*c[k]
            end
        end
    end
    return s
end

Or wite a general function that make use of Iterators.product :

function sumprod2(lists::Vararg{Vector{Float64}})
    s = 0
    p = [length(list) for list in lists]
    ranges = [1:i for i in p]
    for inds in Iterators.product(ranges...)
        c = [lists[i][inds[i]] for i in 1:length(inds)]
        s += reduce(*, c)
    end
    return s
end

The problem is that the general solution is much slower than the nested for loops:

N = 2^8
a = rand(N)
b = rand(N)
c = rand(N)
@time println(sumprod1(a,b,c))
@time println(sumprod2(a,b,c))

returns

  0.040710 seconds (9.22 k allocations: 719.312 KiB, 62.68% compilation time)
  8.615370 seconds (151.25 M allocations: 7.267 GiB, 2.29% gc time, 1.29% compilation time)

How can I understand this ? Is there a way to speedup the general solution ?

You have no hope of coming close to the for loops here, because length(ranges) in your code is not known until runtime, so the Iterators.product is type-unstable.

If you want to have decent performance with Iterators.product, you need to use containers whose lengths (i.e. the “dimensionality” of your loop) are known at compile time, like tuples.

I see that most of the time is spent in lists[i][inds[i]] . Could it be that the problem comes from not knowing the order in which the vectors are iterated over ? Could there be a way to specify they will be iterated over in sequential order ?
Also the second option allocates way more memory than the first one, maybe suggesting that its not using the cache ? Can I force cache usage ?

The length of lists is known from type-level information, but this expression “loses” that quality, the length of p perhaps isn’t known to the compiler. Also it allocates unnecessarily. Same with ranges and c.

Try something like this instead:

function sumprod2(lists::Vararg{Vector{Float64}})
    p = map(length, lists)
    to_range = n -> Base.OneTo(n)
    ranges = map(to_range, p)
    s = 0
    for inds in Iterators.product(ranges...)
        c = (lists[i][inds[i]] for i in 1:length(inds))
        s += reduce(*, c)
    end
    return s
end

Everything required for a fast version is known at compile time, but with varargs (the list... argument) you sometimes need to force specialization using the type annotation Vararg{T,N} where {N}. Also, your implementation had issues with excessive allocation of intermediate arrays, as well as initializing s to a zero of the wrong type (0 is an integer zero). Here’s a fast version:

function sumprod3(lists::Vararg{Any,N}) where {N}
    s = prod(zero ∘ eltype, lists)
    for inds in Iterators.product(eachindex.(lists)...)
        s += prod(lists[i][inds[i]] for i in eachindex(lists, inds))
    end
    return s
end

Benchmarked against a version of sumprod1 with the initialization issue fixed:

julia> using BenchmarkTools

julia> @btime sumprod1fixed($a, $b, $c)
  35.544 ms (0 allocations: 0 bytes)
2.067942582493859e6

julia> @btime sumprod3($a, $b, $c)
  36.265 ms (0 allocations: 0 bytes)
2.0679425824940344e6
Definition of sumprod1fixed
function sumprod1fixed(a, b, c)
    s = zero(eltype(a)) * zero(eltype(b)) * zero(eltype(c))
    for i in eachindex(a), j in eachindex(b), k in eachindex(c)
        s += a[i] * b[j] * c[k]
    end
    return s
end

Avoiding the use of an iterator here could give possible further perf improvement. Something like this:

function sumprod4(lists::Vararg{Any,N}) where {N}
    s = prod(zero ∘ eltype, lists)
    for inds in Iterators.product(eachindex.(lists)...)
        s += mapreduce(getindex, *, lists, inds)
    end
    s
end

That version, verbatim, was my first attempt, but it’s a hair slower, which is why I went with the generator. Julia’s mapreduce appears not to be fully optimized in general.

julia> @btime sumprod4($a, $b, $c)
  40.655 ms (0 allocations: 0 bytes)
2.0679425824940344e6

Awesome, thank you

FWIW, on the laptop in my lap, with nightly Julia, both versions seem equally fast (13.4 ms):

julia> using BenchmarkTools

julia> (@benchmark sumprod3(a, b, c) setup=(N = 2^8; a = rand(N); b = rand(N); c = rand(N)));

julia> median(ans).time
1.33713225e7

julia> (@benchmark sumprod4(a, b, c) setup=(N = 2^8; a = rand(N); b = rand(N); c = rand(N)));

julia> median(ans).time
1.336118e7

Interesting. The difference (~36 ms. vs ~40 ms) is reproducible on my laptop, also when sampling different a, b, c per sample like you did (and using battery saver mode to avoid thermal throttling).

FTR providing an init keyword argument to the mapreduce in the loop body in sumprod4 achieves a further, small, speedup. Or maybe it was just noise, lol.

FWIW:

julia> versioninfo()
Julia Version 1.12.0-DEV.1205
Commit 346f38bceab (2024-09-14 19:40 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

Maybe an improvement between 1.10.5 and 1.12 then, I’m not quite as bleeding edge as you (also I have an Intel i7, but I don’t suppose Intel vs. AMD would matter for the ability to optimize mapreduce relative to generators)

Have you considered re-distributing into sumprodX(vecs...) = prod((sum(vec) for vec in vecs))?

julia> @btime sumprod1($a,$b,$c)
  22.297 ms (0 allocations: 0 bytes)
2.2209763892047782e6

julia> @btime sumprodX($a,$b,$c)
  104.608 ns (0 allocations: 0 bytes)
2.220976389204965e6

Even simpler: prod(sum, vecs)

To clarify a little: The product-of-sums vs sum-of-product formulation is not so much about coding style or type inference, it’s about complexity class (O(n*k) instead of O(n^k), for k vectors of n elements each).

It’s not clear from your question whether that is relevant to your real setting – might as well be that you have a different problem, and you just chose an ill-considered stand-in / model problem for discourse (ill-considered because it admits structure and optimizations that are not present in your real problem). Choosing the right simplified toy model is hard!

Yes exaclty, it is not relevant to my problem. I have a different problem and this sum-of-product was a simplified toy model.